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Preface 


Probability is common sense reduced to calculation 


Laplace 


This book is an outgrowth of our involvement in teaching an introductory prob- 
ability course (“Probabilistic Systems Analysis") at the Massachusetts Institute 
of Technology. 

The course is attended by a large number of students with diverse back- 
grounds, and a broad range of interests. They span the entire spectrum from 
freshmen to beginning graduate students, and from the engineering school to the 
school of management. Accordingly, we have tried to strike a balance between 
simplicity in exposition and sophistication in analytical reasoning. Our key aim 
has been to develop the ability to construct and analyze probabilistic models in 
a manner that combines intuitive understanding and mathematical precision. 

In this spirit, some of the more mathematically rigorous analysis has been 
just sketched or intuitively explained in the text. so that complex proofs do not 
stand in the way of an otherwise simple exposition. At the same time, some of 
this analysis is developed (at the level of advanced calculus) in theoretical prob- 
lems, that are included at the end of the corresponding chapter. Furthermore, 
some of the subtler mathematical issues are hinted at in footnotes addressed to 
the more attentive reader. 

The book covers the fundamentals of probability theory (probabilistic mod- 
els, discrete and continuous random variables, multiple random variables, and 
limit theorems), which are typically part of a first course on the subject. It 
also contains, in Chapters 4-6 a number of more advanced topics, from which an 
instructor can choose to match the goals of a particular course. In particular, in 
Chapter 4, we develop transforms, a more advanced view of conditioning, sums 
of random variables, least squares estimation, and the bivariate normal distribu- 
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tion. Furthermore, in Chapters 5 and 6, we provide a fairly detailed introduction 
to Bernoulli, Poisson, and Markov processes. 

Our M.I.T. course covers all seven chapters in a single semester, with the ex- 
ception of the material on the bivariate normal (Section 4.7), and on continuous- 
time Markov chains (Section 6.5). However, in an alternative course, the material 
on stochastic processes could be omitted, thereby allowing additional emphasis 
on foundational material, or coverage of other topics of the instructor's choice. 

Our most notable omission in coverage is an introduction to statistics. 
While we develop all the basic elements of Bayesian statistics, in the form of 
Bayes' rule for discrete and continuous models, and least squares estimation, we 
do not enter the subjects of parameter estimation, or non-Bayesian hypothesis 
testing. 

The problems that supplement the main text are divided in three categories: 


(a) Theoretical problems: The theoretical problems (marked by *) constitute 
an important component of the text, and ensure that the mathematically 
oriented reader will find here a smooth development without major gaps. 
Their solutions are given in the text, but an ambitious reader may be able 
to solve many of them, especially in earlier chapters, before looking at the 
solutions. 


(b) Problems in the tezt: Besides theoretical problems, the text contains several 
problems, of various levels of difficulty. These are representative of the 
problems that are usually covered in recitation and tutorial sessions at 
M.I.T., and are a primary mechanism through which many of our students 
learn the material. Our hope is that students elsewhere will attempt to 
solve these problems, and then refer to their solutions to calibrate and 
enhance their understanding of the material. The solutions are posted on 
the book's www site 


http:/ /www.athenasc.com/probbook.html 


(c) Supplementary problems: There is a large (and growing) collection of ad- 
ditional problems, which is not included in the book, but is made available 
at the book's www site. Many of these problems have been assigned as 
homework or exam problems at M.I.T., and we expect that instructors 
elsewhere will use them for a similar purpose. While the statements of 
these additional problems are publicly accessible, the solutions are made 
available from the authors only to course instructors. 


We would like to acknowledge our debt to several people who contributed 
in various ways to the book. Our writing project began when we assumed re- 
sponsibility for a popular probability class at M.I.T. that our colleague Al Drake 
had taught for several decades. We were thus fortunate to start with an organi- 
zation of the subject that had stood the test of time, a lively presentation of the 
various topics in Al's classic textbook, and a rich set of material that had been 
used in recitation sessions and for homework. We are thus indebted to Al Drake 
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for providing a very favorable set of initial conditions. 

We are thankful to the several colleagues who have either taught from the 
draft of the book at various universities or have read it, and have provided us 
with valuable feedback. In particular, we thank Ibrahim Abou Faycal, Gustavo 
de Veciana. Eugene Feinberg. Bob Gray, Muriel Médard, Jason Papastavrou, 
Ilya Pollak, David Tse, and Terry Wagner. 

The teaching assistants for the M.I.T. class have been very helpful. They 
pointed out corrections to various drafts, they developed problems and solutions 
suitable for the class, and through their direct interaction with the student body, 
they provided a robust mechanism for calibrating the level of the material. 

Reaching thousands of bright students at M.I.T. at an early stage in their 
studies was a great source of satisfaction for us. We thank them for their valu- 
able feedback and for being patient while they were taught from a textbook-in- 
progress. 

Last but not least, we are grateful to our families for their support through- 
out the course of this long project. 


Dimitri P. Bertsekas, dimitribQmit.edu 
John N. Tsitsiklis, jntQ@mit.edu 


Cambridge, Mass., May 2002 
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Preface to the Second Edition 


This is a substantial revision of the 1st edition, involving a reorganization of old 
material and the addition of new material. The length of the book has increased 
by about 25 percent. The main changes are the following: 


(a) Two new chapters on statistical inference have been added. one on Bayesian 
and one on classical methods. Our philosophy has been to focus on the 
main concepts and to facilitate understanding of the main methodologies 
through some key examples. 


Chapters 3 and 4 have been revised, in part to accommodate the new 
material of the inference chapters and in part to streamline the presenta- 
tion. Section 4.7 of the 1st edition (bivariate normal distribution) has been 
omitted from the new edition, but is available at the book's website. 


g 


(c) A number of new examples and end-of-chapter problems have been added. 


The main objective of the new edition is to provide flexibility to instructors 
in their choice of material, and in particular to give them the option of including 
an introduction to statistical inference. Note that Chapters 6-7, and Chapters 8- 
9 are mutually independent, thus allowing for different paths through the book. 
Furthermore, Chapter 4 is not needed for Chapters 5-7, and only Sections 4.2-4.3 
from Chapter 4 are needed for Chapters 8 and 9. Thus, some possible course 
offerings based on this book are: 


(a) Probability and introduction to statistical inference: Chapters 1-3, Sections 
4.2-4.3, Chapter 5, Chapters 8-9. 


(b) Probability and introduction to stochastic processes: Chapters 1-3 and 5-7, 
with possibly a few sections from Chapter 4. 


We would like to express our thanks to various colleagues who have con- 
tributed valuable comments on the material in the 1st edition and/or the or- 
ganization of the material in the new chapters. Ed Coffman, Munther Dahleh, 
Vivek Goyal, Anant Sahai, David Tse, George Verghese, Alan Willsky, and John 
Wyatt have been very helpful in this regard. Finally, we thank Mengdi Wang 
for her help with figures and problems for the new chapters. 


Dimitri P. Bertsekas, dimitrib@mit.edu 
John N. Tsitsiklis, jntGmit.edu 


Cambridge, Mass., June 2008 
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2 Sample Space and Probability Chap. 1 


"Probability" is a very useful concept, but can be interpreted in a number of 
ways. As an illustration, consider the following. 


A patient is admitted to the hospital and a potentially life-saving drug is 
administered. The following dialog takes place between the nurse and a 
concerned relative. 


RELATIVE: Nurse, what is the probability that the drug will work? 
NURSE: I hope it works, we'll know tomorrow. 

RELATIVE: Yes, but what is the probability that it will? 

NURSE: Each case is different, we have to wait. 

RELATIVE: But let's see, out of a hundred patients that are treated under 
similar conditions, how many times would you expect it to work? 

NURSE (somewhat annoyed): I told you, every person is different, for some 
it works, for some it doesn't. 

RELATIVE (insisting): Then tell me, if you had to bet whether it will work 
or not, which side of the bet would you take? 

NURSE (cheering up for a moment): I'd bet it will work. 

RELATIVE (somewhat relieved): OK, now, would you be willing to lose two 
dollars if it doesn't work, and gain one dollar if it does? 

NURSE (exasperated): What a sick thought! You are wasting my time! 





In this conversation, the relative attempts to use the concept of probability 
to discuss an uncertain situation. The nurse's initial response indicates that the 
meaning of “probability” is not uniformly shared or understood, and the relative 
tries to make it more concrete. The first approach is to define probability in 
terms of frequency of occurrence, as a percentage of successes in a moderately 
large number of similar situations. Such an interpretation is often natural. For 
example, when we say that a perfectly manufactured coin lands on heads “with 
probability 5096," we typically mean "roughly half of the time." But the nurse 
may not be entirely wrong in refusing to discuss in such terms. What if this 
was an experimental drug that was administered for the very first time in this 
hospital or in the nurse's experience? 

While there are many situations involving uncertainty in which the fre- 
quency interpretation is appropriate, there are other situations in which it is 
not. Consider. for example, a scholar who asserts that the Iliad and the Odyssey 
were composed by the same person, with probability 9096. Such an assertion 
conveys some information, but not in terms of frequencies, since the subject is 
a one-time event. Rather, it is an expression of the scholar's subjective be- 
lief. One might think that subjective beliefs are not interesting, at least from a 
mathematical or scientific point of view. On the other hand, people often have 
to make choices in the presence of uncertainty, and a systematic way of making 
use of their beliefs is a prerequisite for successful, or at least consistent, decision 
making. 


1.1 
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In fact, the choices and actions of a rational person can reveal a lot about 
the inner-held subjective probabilities, even if the person does not make conscious 
use of probabilistic reasoning. Indeed, the last part of the earlier dialog was an 
attempt to infer the nurse's beliefs in an indirect manner. Since the nurse was 
willing to accept a one-for-one bet that the drug would work, we may infer 
that the probability of success was judged to be at least 5096. Had the nurse 
accepted the last proposed bet (two-for-one), this would have indicated a success 
probability of at least 2/3. 

Rather than dwelling further on philosophical issues about the appropriate- 
ness of probabilistic reasoning, we will simply take it as a given that the theory 
of probability is useful in a broad variety of contexts, including some where the 
assumed probabilities only reflect subjective beliefs. There is a large body of 
successful applications in science, engineering, medicine, management, etc., and 
on the basis of this empirical evidence, probability theory is an extremely useful 
tool. 

Our main objective in this book is to develop the art of describing un- 
certainty in terms of probabilistic models, as well as the skill of probabilistic 
reasoning. The first step, which is the subject of this chapter, is to describe 
the generic structure of such models and their basic properties. The models we 
consider assign probabilities to collections (sets) of possible outcomes. For this 
reason, we must begin with a short review of set theory. 


SETS 


Probability makes extensive use of set operations, so let us introduce at the 
outset the relevant notation and terminology. 

A set is a collection of objects, which are the elements of the set. If S is 
a set and z is an element of S, we write z € S. If z is not an element of S, we 
write z ¢ S. A set can have no elements, in which case it is called the empty 
set, denoted by ©. 

Sets can be specified in a variety of ways. If S contains a finite number of 
elements, say 21,22,...,z4, we write it as a list of the elements, in braces: 


S = 219v En}. 


For example, the set of possible outcomes of a die roll is {1, 2, 3,4,5,6}, and the 
set of possible outcomes of a coin toss is {H,T}, where H stands for “heads” 
and T stands for “tails.” 

If S contains infinitely many elements z1,z2,..., which can be enumerated 
in a list (so that there are as many elements as there are positive integers) we 
write 


S = {xz1,Z2,..-}, 


and we say that S is countably infinite. For example, the set of even integers 
can be written as (0,2, —2,4, —4,...), and is countably infinite. 
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Alternatively, we can consider the set of all z that have a certain property 
P, and denote it by 
{z |z satisfies P). 


(The symbol “|” is to be read as "such that.") For example, the set of even 
integers can be written as {k|k/2 is integer). Similarly, the set of all scalars z 
in the interval [0, 1] can be written as {z|0 € z € 1). Note that the elements z 
of the latter set take a continuous range of values, and cannot be written down 
in a list (a proof is sketched in the end-of-chapter problems); such a set is said 
to be uncountable. 

If every element of a set S is also an element of a set T, we say that S 
is a subset of T, and we write S C T or T DS. If S C T and T C S, the 
two sets are equal, and we write S — T. It is also expedient to introduce a 
universal set, denoted by 2, which contains all objects that could conceivably 
be of interest in a particular context. Having specified the context in terms of a 
universal set 2, we only consider sets S that are subsets of 2. 


Set Operations 


The complement of a set S, with respect to the universe 2, is the set {z € 
Q|z € S) of all elements of Q that do not belong to S, and is denoted by S°. 
Note that Qc = Ø. 

The union of two sets S and T is the set of all elements that belong to S 
or T (or both), and is denoted by S UT. The intersection of two sets S and T 
is the set of all elements that belong to both S and T, and is denoted by ST. 
Thus, 

SUT-(z|reSorzecT) 


and 
SnT-(z|reSandzeT). 


In some cases, we will have to consider the union or the intersection of several, 
even infinitely many sets, defined in the obvious way. For example, if for every 
positive integer n, we are given a set Sn, then 


U Sn = Si US2U-+ = (z|z € Sp for some n}, 


n=1 


and 

oo 

(15.2 $n$2n-- = (z|z € S, for all n). 

n-l 
Two sets are said to be disjoint if their intersection is empty. More generally, 
several sets are said to be disjoint if no two of them have a common element. A 


collection of sets is said to be a partition of a set S if the sets in the collection 
are disjoint and their union is S. 
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If z and y are two objects. we use (z. y) to denote the ordered pair of z 
and y. The set of scalars (real numbers) is denoted by R: the set of pairs (or 
triplets) of scalars, i.e.. the two-dimensional plane (or three-dimensional space, 
respectively) is denoted by R? (or 3. respectively). 

Sets and the associated operations are easy to visualize in terms of Venn 
diagrams. as illustrated in Fig. 1.1. 














Figure 1.1: Examples of Venn diagrams. (a) The shaded region is SMT. (b) 
The shaded region is SUT. (c) The shaded region is SN T*. (d) Here, T C S. 
The shaded region is the complement of S. (e) The sets S. T. and U are disjoint. 
(f) The sets S, T, and U form a partition of the set Q. 


The Algebra of Sets 


Set operations have several properties, which are elementary consequences of the 
definitions. Some examples are: 


SUT-TUS. SU(TUU) =(SUT)UU, 
SN(TUU) =(SNT)U(SNU). SU(TNU) (SUT)n(SUU). 

(Seje = S. SnSc — Q. 

SUQ=Q, SNQ =S. 


Two particularly useful properties are given by De Morgan's laws which 


state that à : 
(Us. -[ ss. (ns. - [Jss. 


To establish the first law. suppose that r € (Un Sn). Then. z ¢ UnSn, which 
implies that for every n, we have z € S4. Thus, z belongs to the complement 
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of every Sn. and x € f, S£. This shows that (U,S,)* C 9,56. The converse 
inclusion is established by reversing the above argument, and the first law follows. 
The argument for the second law is similar. 


1.2 PROBABILISTIC MODELS 


A probabilistic model is a mathematical description of an uncertain situation. 
It must be in accordance with a fundamental framework that we discuss in this 
section. Its two main ingredients are listed below and are visualized in Fig. 1.2. 


Elements of a Probabilistic Model 


e The sample space (1, which is the set of all possible outcomes of an 
experiment. 


e The probability law, which assigns to a set A of possible outcomes 
(also called an event) a nonnegative number P(A) (called the proba- 
bility of A) that encodes our knowledge or belief about the collective 
“likelihood” of the elements of A. The probability law must satisfy 
certain properties to be introduced shortly. 








Probability 














Experiment 






Sample space 
{Set of possible outcomes) 






Events 


Figure 1.2: The main ingredients of a probabilistic model. 


Sample Spaces and Events 


Every probabilistic model involves an underlying process, called the experi- 
ment, that will produce exactly one out of several possible outcomes. The set 
of all possible outcomes is called the sample space of the experiment, and is 
denoted by Q. A subset of the sample space, that is, a collection of possible 
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outcomes, is called an event.! There is no restriction on what constitutes an 
experiment. For example, it could be a single toss of a coin, or three tosses, 
or an infinite sequence of tosses. However, it is important to note that in our 
formulation of a probabilistic model. there is only one experiment. So, three 
tosses of a coin constitute a single experiment. rather than three experiments. 

The sample space of an experiment may consist of a finite or an infinite 
number of possible outcomes. Finite sample spaces are conceptually and math- 
ematically simpler. Still, sample spaces with an infinite number of elements are 
quite common. As an example, consider throwing a dart on a square target and 
viewing the point of impact as the outcome. 


Choosing an Appropriate Sample Space 


Regardless of their number. different elements of the sample space should be 
distinct and mutually exclusive, so that when the experiment is carried out 
there is a unique outcome. For example, the sample space associated with the 
roll of a die cannot contain “1 or 3" as a possible outcome and also “1 or 4" 
as another possible outcome. If it did, we would not be able to assign a unique 
outcome when the roll is a 1. 

A given physical situation may be modeled in several different ways, de- 
pending on the kind of questions that we are interested in. Generally, the sample 
space chosen for a probabilistic model must be collectively exhaustive, in the 
sense that no matter what happens in the experiment, we always obtain an out- 
come that has been included in the sample space. In addition, the sample space 
should have enough detail to distinguish between all outcomes of interest to the 
modeler, while avoiding irrelevant details. 


Example 1.1. Consider two alternative games, both involving ten successive coin 
tosses: 


Game 1: We receive $1 each time a head comes up. 


Game 2: We receive $1 for every coin toss. up to and including the first time 
a head comes up. Then. we receive $2 for every coin toss. up to the second 
time a head comes up. More generally, the dollar amount per toss is doubled 
each time a head comes up. 


T Any collection of possible outcomes, including the entire sample space Q and 
its complement, the empty set ©, may qualify as an event. Strictly speaking, however, 
some sets have to be excluded. In particular, when dealing with probabilistic models 
involving an uncountably infinite sample space. there are certain unusual subsets for 
which one cannot associate meaningful probabilities. This is an intricate technical issue, 
involving the mathematics of measure theory. Fortunately, such pathological subsets 
do not arise in the problems considered in this text or in practice. and the issue can be 
safely ignored. 


8 Sample Space and Probability Chap. 


In game 1. it is only the total number of heads in the ten-toss sequence that mat- 
ters. while in game 2, the order of heads and tails is also important. Thus, in 
a probabilistic model for game 1. we can work with a sample space consisting of 
eleven possible outcomes. namely, 0. 1,...,10. In game 2, a finer grain description 
of the experiment is called for, and it is more appropriate to let the sample space 
consist of every possible ten-long sequence of heads and tails. 


Sequential Models 


Many experiments have an inherently sequential character; for example, tossing 
a coin three times. observing the value of a stock on five successive days, or 
receiving eight successive digits at a communication receiver. It is then often 
useful to describe the experiment and the associated sample space by means of 
a tree-based sequential description, as in Fig. 1.3. 


Sample space "Pree-based sequential 


for à pair of rolls description 


Rast 


Leaves 


3 4 
ist roll 





Figure 1.3: Two equivalent descriptions of the sample space of an experiment 
involving two rolls of a 4-sided die. The possible outcomes are all the ordered pairs 
of the form (i. j). where i is the result of the first roll, and j is the result of the 
second. These outcomes can be arranged in a 2-dimensional grid as in the figure 
on the left, or they can be described by the tree on the right. which reflects the 
sequential character of the experiment. Here, each possible outcome corresponds 
to a leaf of the tree and is associated with the unique path from the root to 
that leaf, The shaded area on the left is the event ((1,4), (2,4), (3,4), (4,4)} 
that the result of the second roll is 4. That same event can be described by the 
set of leaves highlighted on the right. Note also that every node of the tree can 
be identified with an event, namely. the set of all leaves downstream from that 
node. For example, the node labeled by a 1 can be identified with the event 
{(1.1), (1, 2). (1, 3). (1, 4)) that the result of the first roll is 1. 


Probability Laws 


Suppose we have settled on the sample space Q associated with an experiment. 
To complete the probabilistic model, we must now introduce a probability law. 
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Intuitively, this specifies the "likelihood? of any outcome, or of any set of possible 
outcomes (an event. as we have called it earlier). More precisely. the probability 
law assigns to every event A. a number P(A), called the probability of A. 
satisfying the following axioms. 


Probability Axioms 


1. (Nonnegativity) P(A) > 0, for every event A. 
2. (Additivity) If A and B are two disjoint events, then the probability 
of their union satisfies 


P(AU B) = P(A) + P(B). 


More generally, if the sample space has an infinite number of elements 
and A), A2,... is a sequence of disjoint events, then the probability of 
their union satisfies 


P(A, U Ag U---) = P(A) + P(A2) *---. 


. (Normalization) The probability of the entire sample space €? is 
equal to 1, that is, P(Q) = 1. 





In order to visualize a probability law. consider a unit of mass which is 
“spread” over the sample space. Then, P(A) is simply the total mass that was 
assigned collectively to the elements of A. In terms of this analogy, the additivity 
axiom becomes quite intuitive: the total mass in a sequence of disjoint events is 
the sum of their individual masses. 

A more concrete interpretation of probabilities is in terms of relative fre- 
quencies: a statement such as P(A) = 2/3 often represents a belief that event A 
will occur in about two thirds out of a large number of repetitions of the exper- 
iment. Such an interpretation, though not always appropriate, can sometimes 
facilitate our intuitive understanding. It will be revisited in Chapter 5, in our 
study of limit theorems. 

There are many natural properties of a probability law. which have not been 
included in the above axioms for the simple reason that they can be derived 
from them. For example, note that the normalization and additivity axioms 
imply that 


1 = P(Q) = P(QUQ) = P(Q) + P(O) = 1 + P(Q). 
and this shows that the probability of the empty event is 0: 
P(Q) = 0. 
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As another example, consider three disjoint events Aj, A2, and A3. We can use 
the additivity axiom for two disjoint events repeatedly, to obtain 


P(A, U A2 U A43) = P(A, U (A2 U A3)) 
= P(A1) + P(A2U A3) 
= P(A1)  P(A2) + P( A3). 
Proceeding similarly, we obtain that the probability of the union of finitely many 


disjoint events is always equal to the sum of the probabilities of these events. 
More such properties will be considered shortly. 


Discrete Models 


Here is an illustration of how to construct a probability law starting from some 
common sense assumptions about a model. 


Example 1.2. Consider an experiment involving a single coin toss. There are two 
possible outcomes, heads (H) and tails (T). The sample space is Q = {H,T}, and 
the events are 


{H.T}, {HB}, (T), Ø. 


If the coin is fair, i.e., if we believe that heads and tails are “equally likely,” we 
should assign equal probabilities to the two possible outcomes and specify that 
P((H)) = P((T)) = 0.5. The additivity axiom implies that 


P({H,T}) = P((H)) + P({T}) 2 1, 


which is consistent with the normalization axiom. Thus, the probability law is given 
by 


P({H,T})=1,  P((H)-05. P({T})=0.5, P(Ø)=0, 


and satisfies all three axioms. 
Consider another experiment involving three coin tosses. The outcome will 
now be a 3-long string of heads or tails. The sample space is 


Q ={HHH, HHT, HTH, HIT, THH, THT, TTH, TTT}. 


We assume that each possible outcome has the same probability of 1/8. Let us 
construct a probability law that satisfies the three axioms. Consider, as an example. 
the event 


A = {exactly 2 heads occur} = {H HT, HTH, THH}. 
Using additivity, the probability of A is the sum of the probabilities of its elements: 
P({HHT, HTH. THH}) = P({HHT}) + P({HTH}) + P({THH}) 
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Similarly, the probability of any event is equal to 1/8 times the number of possible 
outcomes contained in the event. This defines a probability law that satisfies the 
three axioms. 


By using the additivity axiom and by generalizing the reasoning in the 
preceding example, we reach the following conclusion. 


Discrete Probability Law 


If the sample space consists of a finite number of possible outcomes, then the 
probability law is specified by the probabilities of the events that consist of 


a single element. In particular, the probability of any event (51,52,..., Sn} 
is the sum of the probabilities of its elements: 


P((si,s2,...,54]) = P(s1) + P(s2) +--- + P(sn). 





Note that we are using here the simpler notation P(s;) to denote the prob- 
ability of the event {s;}, instead of the more precise P({s;}). This convention 
will be used throughout the remainder of the book. 

In the special case where the probabilities P(si),..., P(Sn) are all the same 
(by necessity equal to 1/n, in view of the normalization axiom), we obtain the 
following. 


Discrete Uniform Probability Law 


If the sample space consists of n possible outcomes which are equally likely 
(i.e., all single-element events have the same probability), then the proba- 
bility of any event A is given by 


number of elements of A 


P(A) = - 





Let us provide a few more examples of sample spaces and probability laws. 


Example 1.3. Consider the experiment of rolling a pair of 4-sided dice (cf. Fig. 
1.4). We assume the dice are fair, and we interpret this assumption to mean that 
each of the sixteen possible outcomes [pairs (i, j), with i. j = 1, 2.3, 4] has the same 
probability of 1/16. To calculate the probability of an event, we must count the 
number of elements of the event and divide by 16 (the total number of possible 
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outcomes). Here are some event probabilities calculated in this way: 


P ((the sum of the rolls is even}) = 8/16 = 1/2, 
P ((the sum of the rolls is odd}) 8/16 = 1/2, 
P({the first roll is equal to the second }) = 4/16 = 1/4, 
)=6 
)e* 


P (the first roll is larger than the second} 


P((at least one roll is equal to 4} 


Sample space for a 


pair of rolls 





2 3 4 
X ist roll 


Event = (the first roll is equal to the second) 


Probability = 4/16 


Figure 1.4: Various events in the experiment of rolling a pair of 4-sided dice, 
and their probabilities, calculated according to the discrete uniform law. 


Continuous Models 


Probabilistic models with continuous sample spaces differ from their discrete 
counterparts in that the probabilities of the single-element events may not be 
sufficient to characterize the probability law. This is illustrated in the following 
examples, which also indicate how to generalize the uniform probability law to 
the case of a continuous sample space. 


Example 1.4. A wheel of fortune is continuously calibrated from 0 to 1, so the 
possible outcomes of an experiment consisting of a single spin are the numbers in 
the interval Q = [0,1]. Assuming a fair wheel, it is appropriate to consider all 
outcomes equally likely, but what is the probability of the event consisting of a 
single element? It cannot be positive, because then, using the additivity axiom, it 
would follow that events with a sufficiently large number of elements would have 
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probability larger than 1. Therefore, the probability of any event that consists of a 
single element must be Q. 

In this example, it makes sense to assign probability b — a to any subinter- 
val [a,b] of [0, 1], and to calculate the probability of a more complicated set by 


evaluating its "length." t This assignment satisfies the three probability axioms and 
qualifies as a legitimate probability law. 


Example 1.5. Romeo and Juliet have a date at a given time, and each will arrive 
at the meeting place with a delay between O0 and 1 hour, with all pairs of delays 
being equally likely. The first to arrive will wait for 15 minutes and will leave if the 
other has not yet arrived. What is the probability that they will meet? 

Let us use as sample space the unit square, whose elements are the possible 
pairs of delays for the two of them. Our interpretation of "equally likely" pairs of 
delays is to let the probability of a subset of Q be equal to its area. This probability 
law satisfies the three probability axioms. The event that Romeo and Juliet will 
meet is the shaded region in Fig. 1.5, and its probability is calculated to be 7/16. 


Figure 1.5: The event M that Romeo and Juliet will arrive within 15 minutes 
of each other (cf. Example 1.5) is 


M = {(z.y) | lz -yl < 1/408 Si, 0€y 1}, 


and is shaded in the figure. The area of M is 1 minus the area of the two unshaded 
triangles, or 1 — (3/4) - (3/4) 2 7/16. Thus, the probability of meeting is 7/16. 





t The "length" of a subset S of [0,1] is the integral fe dt, which is defined, for 
“nice” sets S, in the usual calculus sense. For unusual sets. this integral may not be 
well defined mathematically, but such issues belong to à more advanced treatment of 
the subject. Incidentally, the legitimacy of using length as a probability law hinges on 
the fact that the unit interval has an uncountably infinite number of elements. Indeed. 
if the unit interval had a countable number of elements, with each element having 
zero probability, the additivity axiom would imply that the whole interval has zero 
probability, which would contradict the normalization axiom. 
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Properties of Probability Laws 


Probability laws have a number of properties, which can be deduced from the 
axioms. Some of them are summarized below. 


Some Properties of Probability Laws 
Consider a probability law, and let A, B, and C be events. 
(a) If A C B, then P(A) x P(B). 


(b) P(AU B) = P(A) + P(B) - P(An B). 
(c) P(AU B) < P(A) + P(B). 
(d P(AU BUC) = P(A) + P(ACN B) + P(A4en BenC). 





These properties, and other similar ones, can be visualized and verified 
graphically using Venn diagrams, as in Fig. 1.6. Note that property (c) can be 
generalized as follows: 


P(A U A2U ---U An) € > P(Ai). 
i=1 


To see this, we apply property (c) to the sets A; and Az U---U An, to obtain 
P(A; U A2 U---U An) € P(Ai) + P(A2 U--- U An). 
We also apply property (c) to the sets A2 and Aa U---U An, to obtain 
P(A2U---U An) € P(A2) + P(A3 U---U An). 


We continue similarly. and finally add. 


Models and Reality 


The framework of probability theory can be used to analyze uncertainty in a 
wide variety of physical contexts. Typically, this involves two distinct stages. 


(a) In the first stage, we construct a probabilistic model by specifying a prob- 
ability law on a suitably defined sample space. There are no hard rules to 
guide this step, other than the requirement that the probability law con- 
form to the three axioms. Reasonable people may disagree on which model 
best represents reality. In many cases, one may even want to use a some- 
what "incorrect" model, if it is simpler than the “correct” one or allows for 
tractable calculations. This is consistent with common practice in science 
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and engineering, where the choice of a model often involves a tradeoff be- 
tween accuracy, simplicity, and tractability. Sometimes, a model is chosen 
on the basis of historical data or past outcomes of similar experiments. 
using statistical inference methods, which will be discussed in Chapters 8 
and 9. 


p_a 











AN BOG 





Figure 1.6: Visualization and verification of various properties of probability 
laws using Venn diagrams. If A C B, then B is the union of the two disjoint 
events A and A* N B; see diagram (a). Therefore, by the additivity axiom, we 


have 
P(B) = P(A) + P(A"nB) 2 P(A), 


where the inequality follows from the nonnegativity axiom. and verifies prop- 
erty (a). 

From diagram (b), we can express the events A U B and B as unions of 
disjoint events: 


AUB=AU(A‘TNB), B = (AN Bju (AFN B). 
Using the additivity axiom, we have 
P(AU B)= P(A) + P(A* B). P(B) = P(An B) + P(A* n B). 


Subtracting the second equality from the first and rearranging terms. we obtain 
P(AUB) = P(A) - P(B) - P(ANB), verifying property (b). Using also the fact 
P(An B) > 0 (the nonnegativity axiom), we obtain P(A U B) € P(A) + P(B). 
verifying property (c). 

From diagram (c), we see that the event A U B U C can be expressed as a 
union of three disjoint events: 


AU BUC - Au (A^n B)U (A^ n B' nC). 


so property (d) follows as a consequence of the additivity axiom. 
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(b) In the second stage. we work within a fully specified probabilistic model and 
derive the probabilities of certain events, or deduce some interesting prop- 
erties. While the first stage entails the often open-ended task of connecting 
the real world with mathematics, the second one is tightly regulated by the 
rules of ordinary logic and the axioms of probability. Difficulties may arise 
in the latter if some required calculations are complex, or if a probability 
law is specified in an indirect fashion. Even so, there is no room for ambi- 
guity: all conceivable questions have precise answers and it is only a matter 
of developing the skill to arrive at them. 


Probability theory is full of “paradoxes” in which different calculation 
methods seem to give different answers to the same question. Invariably though, 
these apparent inconsistencies turn out to reflect poorly specified or ambiguous 
probabilistic models. An example, Bertrand's paradox, is shown in Fig. 1.7. 








(a) (b) 


Figure 1.7: This example. presented by L. F. Bertrand in 1889. illustrates the 
need to specify unambiguously a probabilistic model. Consider a circle and an 
equilateral triangle inscribed in the circle. What is the probability that the length 
of a randomly chosen chord of the circle is greater than the side of the triangle? 
The answer here depends on the precise meaning of "randomly chosen." The two 
methods illustrated in parts (a) and (b) of the figure lead to contradictory results. 

In (a). we take a radius of the circle, such as AB, and we choose a point 
C on that radius. with all points being equally likely. We then draw the chord 
through C that is orthogonal to AB. From elementary geometry. AB intersects 
the triangle at the midpoint of AB. so the probability that the length of the chord 
is greater than the side is 1/2. 

In (b), we take a point on the circle, such as the vertex V, we draw the 
tangent to the circle through V, and we draw a line through V that forms a random 
angle with the tangent. with all angles being equally likely. We consider the 
chord obtained by the intersection of this line with the circle. From elementary 
geometry. the length of the chord is greater than the side of the triangle if ® is 
between 7/3 and 27/3. Since ® takes values between O and r. the probability 
that the length of the chord is greater than the side is 1/3. 
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A Brief History of Probability 


e B.C.E. Games of chance were popular in ancient Greece and Rome, but 
no scientific development of the subject took place, possibly because the 
number system used by the Greeks did not facilitate algebraic calculations. 
The development of probability based on sound scientific analysis had to 
await the development of the modern arithmetic system by the Hindus and 
the Arabs in the second half of the first millennium, as well as the flood of 
scientific ideas generated by the Renaissance. 


16th century. Girolamo Cardano, a colorful and controversial Italian 
mathematician, publishes the first book describing correct methods for cal- 
culating probabilities in games of chance involving dice and cards. 


17th century. A correspondence between Fermat and Pascal touches upon 
several interesting probability questions and motivates further study in the 
field. 


18th century. Jacob Bernoulli studies repeated coin tossing and introduces 
the first law of large numbers, which lays a foundation for linking theoreti- 
cal probability concepts and empirical fact. Several mathematicians, such as 
Daniel Bernoulli, Leibnitz, Bayes, and Lagrange, make important contribu- 
tions to probability theory and its use in analyzing real-world phenomena. 
De Moivre introduces the normal distribution and proves the first form of 
the central limit theorem. 


19th century. Laplace publishes an influential book that establishes the 
importance of probability as a quantitative field and contains many original 
contributions, including a more general version of the central limit theo- 
rem. Legendre and Gauss apply probability to astronomical predictions, 
using the method of least squares, thus pointing the way to a vast range of 
applications. Poisson publishes an influential book with many original con- 
tributions, including the Poisson distribution. Chebyshev, and his students 
Markov and Lyapunov, study limit theorems and raise the standards of 
mathematical rigor in the field. Throughout this period, probability theory 
is largely viewed as a natural science, its primary goal being the explanation 
of physical phenomena. Consistently with this goal, probabilities are mainly 
interpreted as limits of relative frequencies in the context of repeatable ex- 
periments. 


20th century. Relative frequency is abandoned as the conceptual foun- 
dation of probability theory in favor of a now universally used axiomatic 
system, introduced by Kolmogorov. Similar to other branches of mathe- 
matics, the development of probability theory from the axioms relies only 
on logical correctness, regardless of its relevance to physical phenomena. 
Nonetheless, probability theory is used pervasively in science and engineer- 
ing because of its ability to describe and interpret most types of uncertain 
phenomena in the real world. 





1.3 
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CONDITIONAL PROBABILITY 


Conditional probability provides us with a way to reason about the outcome 
of an experiment, based on partial information. Here are some examples of 
situations we have in mind: 


(a) In an experiment involving two successive rolls of a die, you are told that 
the sum of the two rolls is 9. How likely is it that the first roll was a 6? 


(b) In a word guessing game, the first letter of the word is a "t". What is the 
likelihood that the second letter is an “h”? 


(c) How likely is it that a person has a certain disease given that a medical 
test was negative? 


(d) A spot shows up on a radar screen. How likely is it to correspond to an 
aircraft? 


In more precise terms, given an experiment, a corresponding sample space, 
and a probability law, suppose that we know that the outcome is within some 
given event B. We wish to quantify the likelihood that the outcome also belongs 
to some other given event A. We thus seek to construct a new probability law 
that takes into account the available knowledge: a probability law that for any 
event A. specifies the conditional probability of A given B. denoted by 
P(A|B). 

We would like the conditional probabilities P(A | B) of different events A 
to constitute a legitimate probability law, which satisfies the probability axioms. 
The conditional probabilities should also be consistent with our intuition in im- 
portant special cases, e.g., when all possible outcomes of the experiment are 
equally likely. For example. suppose that all six possible outcomes of a fair die 
roll are equally likely. If we are told that the outcome is even, we are left with 
only three possible outcomes. namely, 2. 4, and 6. These three outcomes were 
equally likely to start with, and so they should remain equally likely given the 
additional knowledge that the outcome was even. Thus, it is reasonable to let 


s : 1 
P(the outcome is 6 | the outcome is even) = 3 


This argument suggests that an appropriate definition of conditional probability 
when all outcomes are equally likely, is given by 


number of elements of AN B 
P(A B).= number of elements of B 


Generalizing the argument, we introduce the following definition of condi- 
tional probability: 
P(AnB) 
Bic mE. 
P(A|B) P(B) ` 
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where we assume that P(B) » 0; the conditional probability is undefined if the 
conditioning event has zero probability. In words, out of the total probability of 
the elements of B, P(A | B) is the fraction that is assigned to possible outcomes 
that also belong to A. 


Conditional Probabilities Specify a Probability Law 


For a fixed event B, it can be verified that the conditional probabilities P(A | B) 
form a legitimate probability law that satisfies the three axioms. Indeed, non- 
negativity is clear. Furthermore, 


P(ünB) P(B) _ 
PSB =- p) TPO)” 


and the normalization axiom is also satisfied. To verify the additivity axiom, we 
write for any two disjoint events A; and Ao, 


P(A U A2 | B) = A AUR) 
P((A1 N B) U (A2 n B)) 
P(B) 


P(A, N B) + P(42 n B) 


~  P(B) P(B) 
= P(A, | B) + P(A2| B), 


where for the third equality, we used the fact that A; N B and A» N B are 
disjoint sets, and the additivity axiom for the (unconditional) probability law. 
The argument for a countable collection of disjoint sets is similar. 

Since conditional probabilities constitute a legitimate probability law, all 
general properties of probability laws remain valid. For example, a fact such as 
P(AUC) € P(A) 4 P(C) translates to the new fact 


P(AUC|B) € P(A|B) +P(C|B). 


Let us also note that since we have P(B | B) = P(B)/P(B) = 1, all of the con- 
ditional probability is concentrated on B. Thus, we might as well discard all 
possible outcomes outside B and treat the conditional probabilities as a proba- 
bility law defined on the new universe B. 

Let us summarize the conclusions reached so far. 
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Properties of Conditional Probability 


e The conditional probability of an event A, given an event B with 
P(B) » 0, is defined by 


P(An B) 
P(B) ' 


P(A|B) = 


and specifies a new (conditional) probability law on the same sample 
space Q. In particular, all properties of probability laws remain valid 
for conditional probability laws. 


e Conditional probabilities can also be viewed as a probability law on a 
new universe B, because all of the conditional probability is concen- 
trated on B. 


e If the possible outcomes are finitely many and equally likely, then 


number of elements of AN B 
Bier e 
EE number of elements of B 





Example 1.6. We toss a fair coin three successive times. We wish to find the 
conditional probability P(A| B) when A and B are the events 


A = {more heads than tails come up}, B = {1st toss is a head). 
The sample space consists of eight sequences. 
Q — (HHH. HHT. HTH, HTT, THH, THT. TTH, TTT}. 


which we assume to be equally likely. The event B consists of the four elements 
HHH, HHT. ATH. HTT, so its probability is 


4 
P(B)= z 
The event AN B consists of the three elements HHH. HHT. HTH, so its proba- 
bility is 
_ 3 
zo 
Thus, the conditional probability P(A | B) is 


P(An B) 


| P(AnB) 3/8 3 
P(AIB) = bug, = ga 


Because all possible outcomes are equally likely here, we can also compute P(A | B) 
using a shortcut. We can bypass the calculation of P(B) and P(An1B), and simply 
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divide the number of elements shared by A and B (which is 3) with the number of 
elements of B (which is 4) the same result 3/4. 


Example 1.7. A fair 4-sided die is rolled twice and we assume that all sixteen 
possible outcomes are equally likely. Let X and Y be the result of the 1st and the 
2nd roll, respectively. We wish to determine the conditional probability P(A | B), 
where 


A = [max(X, Y) = m}. B= {min(X,Y) = 2). 


and m takes each of the values 1. 2. 3, 4. 

As in the preceding example. we can first determine the probabilities P(AN B) 
and P(B) 
dividing by 16. Alternatively, we can directly divide the number of elements of 
AN B with the number of clements of B: see Fig. 1.8. 


All outcomes equally Hkely 


Probability = 1/16 


2nd roli Y 








Figure 1.8: Sample space of an experiment involving two rolls of a 4-sided die. 
(cf. Example 1.7). The conditioning event B = {min(X.Y) = 2} consists of the 
5-element shaded set. The set A = (max(X, Y) = m) shares with B two elements 
if m = 3 or m = 4. one element if m = 2, and no element if m = 1. Thus, we have 
2/5. ifm — 3 or m -— 4. 
P((max(X.Y) = m} | B) = l 1/5. ifm=2, 
0. if m 1. 


Example 1.8. A conservative design team, call it C. and an innovative design 
team, call it N, are asked to separately design a new product within a month. From 
past experience we know that: 


(a) The probability that team C is successful is 2/3. 
(b) The probability that team N is successful is 1/2. 


(c) The probability that at least one team is successful is 3/4. 
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Assuming that exactly one successful design is produced, what is the probability 
that it was designed by team N? 

There are four possible outcomes here, corresponding to the four combinations 
of success and failure of the two teams: 


SS: both succeed, FF: both fail, 
SF: C succeeds, N fails, FS: C fails, N succeeds. 


We were given that the probabilities of these outcomes satisfy 
P(SS) + P(SF) = P(SS) + P(FS) = 5 P(SS) + P(SF) + P(FS) = 1 
From these relations, together with the normalization equation 
P(SS)+ P(SF)+ P(FS)+P(FF) =1, 


we can obtain the probabilities of individual outcomes: 


P(SS) = =, P(SF)- 4, P(FS) = =, P(FF) = 


The desired conditional probability is 


P(FS | {SF,FS}) = 2 - -. 





Using Conditional Probability for Modeling 


When constructing probabilistic models for experiments that have a sequential 
character, it is often natural and convenient to first specify conditional prob- 
abilities and then use them to determine unconditional probabilities. The rule 
P(ANB) = P(B)P(A|B), which is a restatement of the definition of conditional 
probability, is often helpful in this process. 


Example 1.9. Radar Detection. If an aircraft is present in a certain area, a 
radar detects it and generates an alarm signal with probability 0.99. If an aircraft is 
not present. the radar generates a (false) alarm, with probability 0.10. We assume 
that an aircraft is present with probability 0.05. What is the probability of no 
aircraft presence and a false alarm? What is the probability of aircraft presence 
and no detection? 

A sequential representation of the experiment is appropriate here, as shown 
in Fig. 1.9. Let A and B be the events 


A = {an aircraft is present}, 


B = {the radar generates an alarm}, 
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and consider also their complements 


A‘ = {an aircraft is not present). 
B* = (the radar does not generate an alarm}. 


The given probabilities are recorded along the corresponding branches of the tree de- 
scribing the sample space, as shown in Fig. 1.9. Each possible outcome corresponds 
to a leaf of the tree, and its probability is equal to the product of the probabilities 
associated with the branches in a path from the root to the corresponding leaf. The 
desired probabilities are 


P(not present, false alarm) = P(A* n B) = P(A*)P(B| A‘) = 0.95 - 0.10 = 0.095, 
P(present, no detection) = P(A N B^) = P(A)P(B* | A) = 0.05 - 0.01 = 0.0005. 





Podz605 


Figure 1.9: Sequential description of the experiment for the radar detection 
problem in Example 1.9. 


Extending the preceding example, we have a general rule for calculating 
various probabilities in conjunction with a tree-based sequential description of 
an experiment. In particular: 


(a) We set up the tree so that an event of interest is associated with a leaf. 
We view the occurrence of the event as a sequence of steps, namely, the 
traversals of the branches along the path from the root to the leaf. 


(b) We record the conditional probabilities associated with the branches of the 
tree. 


(c) We obtain the probability of a leaf by multiplying the probabilities recorded 
along the corresponding path of the tree. 
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In mathematical terms, we are dealing with an event A which occurs if and 
only if each one of several events A1,..., An has occurred, i.e., A = AYN AQN 
(1 Án. The occurrence of A is viewed as an occurrence of Ai. followed by the 
occurrence of A5, then of A3, etc., and it is visualized as a path with n branches, 
corresponding to the events A;,..., An. The probability of A is given by the 
following rule (see also Fig. 1.10). 


Multiplication Rule 


Assuming that all of the conditioning events have positive probability, we 
have 


P(O}, A) = P(Ai)P(42 | A) P(As | Ai N A2) P(An | AG! Aq). 





The multiplication rule can be verified by writing 


P(r? 


i=l 


Event A, O A N Ay Event A, (Ag Dn A 





P(A, LÀ, dh. nnn A, 





Figure 1.10: Visualization of the multiplication rule. The intersection event 
A = ANAN- 0 An is associated with a particular path on a tree that 
describes the experiment. We associate the branches of this path with the events 
Aisi s An. and we record next to the branches the corresponding conditional 
probabilities. 

The final node of the path corresponds to the intersection event A, and 
its probability is obtained by multiplying the conditional probabilities recorded 
along the branches of the path 


P(A, MAgN---NAn) = P(A )P(A2| Ai) -- P(An | Ai NAIN- D AR L1). 


Note that any intermediate node along the path also corresponds to some inter- 
section event and its probability is obtained by multiplying the corresponding 
conditional probabilities up to that node. For exainple, the event Aj N A2 N A3 
corresponds to the node shown in the figure. and its probability is 


P(A; N A3 N A3) = P(A)P(A2 | Ai)P(Aa | Ay N Ag). 
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and by using the definition of conditional probability to rewrite the right-hand 
side above as 


P(A1)P(A2 | Ai)P(A43 | Ai n A2) --- P(A« | NY Ai). 


For the case of just two events, A; and Ag, the multiplication rule is simply the 
definition of conditional probability. 


Example 1.10. Three cards are drawn from an ordinary 52-card deck without 
replacement (drawn cards are not placed back in the deck). We wish to find the 
probability that none of the three cards is a heart. We assume that at each step, 
each one of the remaining cards is equally likely to be picked. By symmetry, this 
implies that every triplet of cards is equally likely to be drawn. A cumbersome 
approach, which we will not use, is to count the number of all card triplets that 
do not include a heart, and divide it with the number of all possible card triplets. 
Instead, we use a sequential description of the experiment in conjunction with the 
multiplication rule (cf. Fig. 1.11). 
Define the events 


Aj — (the ith card is not a heart). i — 1.2.3. 


We will calculate P(A; N A2 N As), the probability that none of the three cards is 
a heart, using the multiplication rule 


P(Ai NAN A3) = P(A;)P(A2 | A )P(A3 | Ain Ag). 


We have 


39 
P(A1) = 52’ 
since there are 39 cards that are not hearts in the 52-card deck. Given that the 
first card is not a heart, we are left with 51 cards. 38 of which are not hearts, and 


38 
P(A2| Ai) = 5l. 
Finally, given that the first two cards drawn are not hearts. there are 37 cards which 
are not hearts in the remaining 50-card deck. and 


37 
P(A3 | Ai N A2) = 50 


These probabilities are recorded along the corresponding branches of the tree de- 
scribing the sample space. as shown in Fig. 1.11. The desired probability is now 
obtained by multiplying the probabilities recorded along the corresponding path of 


the tree: PETE 
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Not a heart 


Not a heart Heart 


3RID 


Figure 1.11: Sequential description 
of the experiment in the 3-card se- 


Not a heart I lection problem of Example 1.10. 
30/52 


Note that once the probabilities are recorded along the tree, the probability 
of several other events can be similarly calculated. For example, 


39 13 
P(Ist i t a heart and 2nd is a heart) 2 — - — 
(1st is not a heart and 2nd is a heart) 52 5l' 
; 39 38 13 
P (1st and 2nd are not hearts, and 3rd is a heart) =59 BI 50) 


Example 1.11. À class consisting of 4 graduate and 12 undergraduate students 
is randomly divided into 4 groups of 4. What is the probability that each group 
includes à graduate student? "We interpret "randomly" to mean that given the 
assignment of some students to certain slots, any of the remaining students is equally 
likely to be assigned to any of the remaining slots. We then calculate the desired 
probability using the multiplication rule, based on the sequential description shown 
in Fig. 1.12. Let us denote the four graduate students by 1, 2, 3, 4, and consider 
the events 


A; = {students 1 and 2 are in different groups}, 


A» = {students 1, 2, and 3 are in different groups), 


A3 = (students 1. 2, 3, and 4 are in different groups}. 
We will calculate P( A3) using the multiplication rule: 
P(A3) = P(A1 O A2 As) = P(A1)P(A2| Ai)P(As | Au N A2). 


We have 
12 


= 15 3 
since there are 12 student slots in groups other than the one of student 1, and there 
are 15 student slots overall, excluding student 1. Similarly, 


P(A1) 


8 
P(A2|]A1) = 14 
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Students 1, 2. 3, & 4 are 

in different groups 

4/13 

Students 1, 2, & 3 are 
in different groups 


8/14 








Figure 1.12: Sequential descrip- 
tion of the experiment in the stu- 


Students 1 & 2 are dent problem of Example 1.11. 


in different groups 


12/15 


since there are 8 student slots in groups other than those of students 1 and 2, and 
there are 14 student slots, excluding students 1 and 2. Also, 


4 
P(As| Ai N A2) = 13' 


since there are 4 student slots in groups other than those of students 1. 2. and 3, 
and there are 13 student slots, excluding students 1, 2, and 3. Thus, the desired 
probability is 


and is obtained by multiplying the conditional probabilities along the corresponding 
path of the tree in Fig. 1.12. 


Example 1.12. The Monty Hall Problem. This is a much discussed puzzle, 
based on an old American game show. You are told that a prize is equally likely to 
be found behind any one of three closed doors in front of you. You point to one of 
the doors. A friend opens for you one of the remaining two doors, after making sure 
that the prize is not behind it. At this point, you can stick to your initial choice, 
or switch to the other unopened door. You win the prize if it lies behind your final 
choice of a door. Consider the following strategies: 


(a) Stick to your initial choice. 
(b) Switch to the other unopened door. 


(c) You first point to door 1. If door 2 is opened, you do not switch. If door 3 is 
opened, you switch. 


Which is the best strategy? To answer the question, let us calculate the probability 
of winning under each of the three strategies. 

Under the strategy of no switching, your initial choice will determine whether 
you win or not, and the probability of winning is 1/3. This is because the prize is 
equally likely to be behind each door. 

Under the strategy of switching, if the prize is behind the initially chosen 
door (probability 1/3). you do not win. If it is not (probability 2/3), and given that 
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another door without a prize has been opened for you, you will get to the winning 
door once you switch. Thus. the probability of winning is now 2/3, so (b) is a better 
strategy than (a). 

Consider now strategy (c). Under this strategy, there is insufficient informa- 
tion for determining the probability of winning. The answer depends on the way 
that your friend chooses which door to open. Let us consider two possibilities. 

Suppose that if the prize is behind door 1, your friend always chooses to open 
door 2. (If the prize is behind door 2 or 3, your friend has no choice.) If the prize 
is behind door 1. your friend opens door 2, you do not switch, and you win. If the 
prize is behind door 2, your friend opens door 3, you switch. and you win. If the 
prize is behind door 3. your friend opens door 2. you do not switch, and you lose. 
Thus, the probability of winning is 2/3. so strategy (c) in this case is as good as 
strategy (b). 

Suppose now that if the prize is behind door 1. your friend is equally likely to 
open either door 2 or 3. If the prize is behind door 1 (probability 1/3). and if your 
friend opens door 2 (probability 1/2), you do not switch and you win (probability 
1/6). But if your friend opens door 3, you switch and you lose. If the prize is behind 
door 2, your friend opens door 3. you switch, and you win (probability 1/3). If the 
prize is behind door 3, your friend opens door 2, you do not switch and you lose. 
Thus. the probability of winning is 1/6 + 1/3 = 1/2, so strategy (c) in this case is 
inferior to strategy (b). 


1.4 TOTAL PROBABILITY THEOREM AND BAYES’ RULE 


In this section, we explore some applications of conditional probability. We start 
with the following theorem. which is often useful for computing the probabilities 
of various events. using a "divide-and-conquer" approach. 


Total Probability Theorem 


Let Ai,...,An be disjoint events that form a partition of the sample space 
(each possible outcome is included in exactly one of the events A1,..., An) 
and assume that P(A;) > 0, for all i. Then, for any event B, we have 


P(B) = P(Ain B) 9 --- - P(A, B) 
= P(Ai)P(B| A1) +--+ P(A,)P(B]| An). 





The theorem is visualized and proved in Fig. 1.13. Intuitively, we are par- 
titioning the sample space into a number of scenarios (events) A;. Then, the 
probability that B occurs is a weighted average of its conditional! probability 
under each scenario, where each scenario is weighted according to its (uncondi- 
tional) probability. One of the uses of the theorem is to compute the probability 
of various events B for which the conditional probabilities P(B | A;) are known or 
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easy to derive. The key is to choose appropriately the partition Ay..... An. and 
this choice is often suggested by the problem structure. Here are some examples. 








Figure 1.13: Visualization and verification of the total probability theorem. The 
events Aj,....An form a partition of the sample space. so the event B can be 
decomposed into the disjoint union of its intersections A, N B with the sets A,. 
Le, 


B-(AinB)U---U(A,n B). 
Using the additivity axiom. it follows that 
P(B) = P(A; NB) +--+ + P(An NB). 
Since, by the definition of conditional probability, we have 
P(A; N B)  P(A)P(BIA),). 
the preceding equality yields 
P(B) = P(A1)P(B| Ai) +--+ + P(An)P(B| An). 


For an alternative view. consider an equivalent sequential model. as shown 
on the right. The probability of the leaf Aj N B is the product P(A;)P(B|A,) of 
the probabilities along the path leading to that leaf. The event B consists of the 
three highlighted leaves and P(B) is obtained by adding their probabilities. 


Example 1.13. You enter a chess tournament where your probability of winning 
a game is 0.3 against half the players (call them type 1). 0.4 against a quarter of 
the players (call them type 2), and 0.5 against the remaining quarter of the players 
(call them type 3). You play a game against a randomly chosen opponent. What 
is the probability of winning? 
Let A, be the event of playing with an opponent of type i. We have 
P(A1) = 0.5, P(A2) = 0.25. P(A3) = 0.25. 
Also. let B be the event of winning. We have 
P(B| A,) = 0.3. P(B|A2) = 0.4. P(B| A3) = 0.5. 
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Thus, by the total probability theorem, the probability of winning is 


P(B) = P(Ai)P(B| A1) + P(A2)P(B| A2) + P(A3)P(B | Aa) 
= 0.5 - 0.3 + 0.25 - 0.4 + 0.25 - 0.5 
= 0.375. 


Example 1.14. You roll a fair four-sided die. If the result is 1 or 2, you roll once 
more but otherwise, you stop. What is the probability that the sum total of your 
rolls is at least 4? 

Let A; be the event that the result of first roll is 7, and note that P(A,) = 1/4 
for each i. Let B be the event that the sum total is at least 4. Given the event A1, 
the sum total will be at least 4 if the second roll results in 3 or 4, which happens 
with probability 1/2. Similarly, given the event A2, the sum total will be at least 
4 if the second roll results in 2, 3, or 4, which happens with probability 3/4. Also, 
given the event A3, you stop and the sum total remains below 4. Therefore, 


1 3 
P(B|lA)25. = P(BlA2)= 7,  P(Bla)-0, P(B| Aa) = 1. 


By the total probability theorem. 


The total probability theorem can be applied repeatedly to calculate proba- 


bilities in experiments that have a sequential character, as shown in the following 
example. 


Example 1.15. Alice is taking a probability class and at the end of each week 
she can be either up-to-date or she may have fallen behind. If she is up-to-date in 
a given week, the probability that she will be up-to-date (or behind) in the next 
week is 0.8 (or 0.2, respectively). If she is behind in a given week, the probability 
that she will be up-to-date (or behind) in the next week is 0.4 (or 0.6, respectively). 
Alice is (by default) up-to-date when she starts the class. What is the probability 
that she is up-to-date after three weeks? 

Let U; and B; be the events that Alice is up-to-date or behind, respectively, 
after 4 weeks. According to the total probability theorem, the desired probability 
P(U3) is given by 


P(U3) = P(U2)P(Us | U2) + P(B2)P(U3 | B2) = P(U2) - 0.8 + P(Bo) - 0.4. 


The probabilities P(U2) and P(B2) can also be calculated using the total probability 
theorem: 


P(U2) = P(U1)P(U2|U1) + P(B1)P(U2| Bi) = P(U1) - 0.8 + P(B1)- 0.4, 


P(B3) = P(U;)P( Bo |U1) + P(Bi)P(B2 | 1) = P(U;) -0.2 + P(Bi) - 0.6. 
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Finally, since Alice starts her class up-to-date, we have 
P(U1) = 0.8, P(B,) = 0.2. 
We can now combine the preceding three equations to obtain 
P(U2) = 0.8 - 0.8 + 0.2 - 0.4 = 0.72, 


P(B2) = 0.8 - 0.2 + 0.2 - 0.6 = 0.28, 


and by using the above probabilities in the formula for P(U3): 
P(U3) = 0.72 - 0.8 + 0.28 - 0.4 = 0.688. 


Note that we could have calculated the desired probability P(U3) by con- 
structing a tree description of the experiment, then calculating the probability of 
every element of U3 using the multiplication rule on the tree, and adding. However. 
there are cases where the calculation based on the total probability theorem is more 
convenient. For example. suppose we are interested in the probability P(U20) that 
Alice is up-to-date after 20 weeks. Calculating this probability using the multipli- 
cation rule is very cumbersome. because the tree representing the experiment is 20 
stages deep and has 2? leaves. On the other hand, with a computer, a sequential 
calculation using the total probability formulas 


P(Bi+1) = P(Ui) - 0.2 + P(Bi) - 0.6. 
and the initial conditions P(U;) = 0.8. P(B,) = 0.2. is very simple. 


Inference and Bayes’ Rule 


The total probability theorem is often used in conjunction with the following 
celebrated theorem, which relates conditional probabilities of the form P(A | B) 
with conditional probabilities of the form P(B | A), in which the order of the 
conditioning is reversed. 


Bayes' Rule 


Let A1, A2,...,An be disjoint events that form a partition of the sample 
space, and assume that P(A;) > 0, for all i. Then, for any event B such 
that P(B) » 0, we have 


P(A;jP(B|A:) 
P(B) 


P(Ai| B) = 


E P(A;)P(B | Ai) 
~ P(A)P(B|A1) +: + P(An)P(B/ An) 
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A, 





Turnor 





Figure 1.14: An example of the inference context that is implicit in Bayes' 
rule. We observe a shade in a person's X-ray (this is event B, the "effect") and 
we want to estimate the likelihood of three mutually exclusive and collectively 
exhaustive potential causes: cause 1 (event A1) is that there is a malignant tumor, 
cause 2 (event A?) is that there is a nonmalignant tumor, and cause 3 (event 
A3) corresponds to reasons other than a tumor. We assume that we know the 
probabilities P(A;) and P(B| A,), i = 1.2.3. Given that we see a shade (event 
B occurs). Bayes’ rule gives the posterior probabilities of the various causes as 


P(A,)P(B| Ai) 


P(Ail B) = SAPU Ai) + PUPE] Az) 4 P(As)P(B) As) 





i-1.2,3. 


For an alternative view, consider an equivalent sequential model, as shown 
on the right. The probability P(A; | B) of a malignant tumor is the probability 
of the first highlighted leaf. which is P(A; N B). divided by the total probability 
of the highlighted leaves. which is P(B). 


To verify Bayes' rule. note that by the definition of conditional probability, 
we have 


P(A; N B) = P(A,)P(B| Ai) = P(A; | B)P(B). 


This yields the first equality. The second equality follows from the first by using 
the total probability theorem to rewrite P(B). 

Bayes’ rule is often used for inference. There are a number of “causes” 
that may result in a certain “effect.” We observe the effect, and we wish to infer 
the cause. The events A},....An are associated with the causes and the event B 
represents the effect. The probability P(B | Ai) that the effect will be observed 
when the cause A; is present amounts to a probabilistic model of the cause-effect 
relation (cf. Fig. 1.14). Given that the effect B has been observed, we wish to 
evaluate the probability P(A;| B) that the cause A; is present. We refer to 
P(A;| B) as the posterior probability of event A; given the information, to 
be distinguished from P(A;), which we call the prior probability. 
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Example 1.16. Let us return to the radar detection problem of Example 1.9 and 
Fig. 1.9. Let 
A — (an aircraft is present), 


B — (the radar generates an alarm). 


We are given that 
P(A) = 0.05, P(B | A) = 0.99, P(B| A) = 0.1. 
Applying Bayes’ rule, with A; = A and A» = A^, we obtain 


P(aircraft present | alarm) = P(A 


_ P(A)P(B | A) 

~ P(A)P(B| A) + P( 4-)P(B| A‘) 
E 0.05 - 0.99 

^ 0.05 - 0.99 + 0.95 - 0.1 


zæ 0.3426. 


Example 1.17. Let us return to the chess problem of Example 1.13. Here. A, is 
the event of getting an opponent of type i, and 


P(A1) = 0.5, P(A2) = 0.25. P(A3) = 0.25. 
Also, B is the event of winning, and 
P(B |A) = 0.3. P(B| A3) = 0.4, P(B | A3) = 0.5. 


Suppose that you win. What is the probability P(A, | B) that you had an opponent 
of type 1? 
Using Bayes’ rule, we have 


P(A:1)P(B| A1) 
P(Ai)P(B| A1) + P(A2)P(B | A2) + P(A3)P(B | As) 


= 0.5 - 0.3 
— 0.5.-0.34-0.25- 0.4 +0.25- 0.5 


= 0.4. 


P(A; | B) = 


Example 1.18. The False-Positive Puzzle. A test for a certain rare disease is 
assumed to be correct 95% of the time: if a person has the disease, the test results 
are positive with probability 0.95, and if the person does not have the disease, 
the test results are negative with probability 0.95. A random person drawn from 
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a certain population has probability 0.001 of having the disease. Given that the 
person just tested positive, what is the probability of having the disease? 

If A is the event that the person has the disease, and B is the event that the 
test results are positive, the desired probability. P(A | B), is 


P(A)P(B | A) 
P(A)P(B| A) + P(A*)P(B| A‘) 
S 0.001 - 0.95 
~ 0.001 - 0.95 + 0.999 - 0.05 


P(A|B) = 


= 0.0187. 


Note that even though the test was assumed to be fairly accurate, a person who has 
tested positive is still very unlikely (less than 2%) to have the disease. According 
to The Economist (February 20th. 1999). 80% of those questioned at a leading 
American hospital substantially missed the correct answer to a question of this 
type; most of them thought that the probability that the person has the disease 
is 0.95! 


1.5 INDEPENDENCE 


We have introduced the conditional probability P(A | B) to capture the partial 
information that event B provides about event A. An interesting and important 
special case arises when the occurrence of B provides no such information and 
does not alter the probability that A has occurred, i.e., 


P(A| B) = P(A). 


When the above equality holds. we say that A is independent of B. Note that 
by the definition P(A |B) = P(A N B)/P(B), this is equivalent to 


P(An B) = P(A)P(B). 


We adopt this latter relation as the definition of independence because it can be 
used even when P(B) = 0, in which case P(A | B) is undefined. The symmetry 
of this relation also implies that independence is a symmetric property; that is, 
if A is independent of B, then B is independent of A, and we can unambiguously 
say that A and B are independent events. 

Independence is often easy to grasp intuitively. For example, if the occur- 
rence of two events is governed by distinct and noninteracting physical processes, 
such events will turn out to be independent. On the other hand, independence 
is not easily visualized in terms of the sample space. A common first thought 
is that two events are independent if they are disjoint, but in fact the oppo- 
site is true: two disjoint events A and B with P(A) » 0 and P(B) » 0 are 
never independent, since their intersection AN B is empty and has probability 0. 
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For example, an event A and its complement A« are not independent [unless 
P(A) = 0 or P(A) = 1], since knowledge that A has occurred provides precise 


Independence 


information about whether A€ has occurred. 


Example 1.19. Consider an experiment involving two successive rolls of a 4-sided 
die in which all 16 possible outcomes are equally likely and have probability 1/16. 


(a) Are the events 


Ai = {1st roll results in i), B; = (2nd roll results in 7), 


independent? We have 


P(Ain Bj) = P (the outcome of the two rolls is (i, j)) = = 
P(A;) = number of elements of A: _4 
t — total number of possible outcomes 16’ 
P(B;) = number of elements of B, 4 l 


— total number of possible outcomes = 16 


We observe that P(A; N B,) = P(Ai)P(B;), and the independence of A, and 
Bj is verified. Thus, our choice of the discrete uniform probability law implies 


the independence of the two rolls. 


Are the events 
A = (1st roll is a 1}, B = {sum of the two rolls is a 5}, 


independent? The answer here is not quite obvious. We have 
P(AN B) = P (the result of the two rolls is (1,4)) = z 


and also 


number of elements of A = 4 


P =. 
(A) total number of possible outcomes 16 


The event B consists of the outcomes (1,4), (2,3), (3,2), and (4,1). and 


P(B) = number of elements of B zd. 


~ total number of possible outcomes — 16 








Thus, we see that P(A N B) = P(A)P(B), and the events A and B are 


independent. 


Are the events 


A = (maximum of the two rolls is 2), B = (minimum of the two rolls is 2}, 
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independent? Intuitively. the answer is ^no" because the minimum of the 
two rolls conveys some information about the maximum. For example, if the 
minimum is 2, the maximum cannot be 1. More precisely, to verify that A 
and B are not independent, we calculate 





P(An B) = P(the result of the two rolls is (2,2)) = 5 
and also 
P(A) = number of elements of A _ 3 
~ total number of possible outcomes — 16" 
P(B) = number of elements of B = 5 


total number of possible outcomes 16 


We have P(A)P(B) = 15/(16)?, so that P(An B) x P(A)P(B). and A and 
B are not independent. 


We finally note that, as mentioned earlier, if A and B are independent, the 
occurrence of B does not provide any new information on the probability of A 
occurring. It is then intuitive that the non-occurrence of B should also provide 
no information on the probability of A. Indeed. it can be verified that if A and 
B are independent, the same holds true for A and B° (see the end-of-chapter 
problems). 


Conditional Independence 


We noted earlier that the conditional probabilities of events, conditioned on 
a particular event. form a legitimate probability law. We can thus talk about 
independence of various events with respect to this conditional law. In particular, 
given an event C. the events A and B are called conditionally independent 
if 
P(An B|C) =P(A|C)P(B|C). 

To derive an alternative characterization of conditional independence, we use the 
definition of the conditional probability and the multiplication rule, to write 


P(An BIC) = ESTE) 


_ P(C)P(B|C)P(A| Bn C) 
u P(C) 
= P(B|C)P(A| BNC). 
We now compare the preceding two expressions. and after eliminating the com- 


mon factor P(B|C), assumed nonzero. we see that conditional independence is 
the same as the condition 


P(A|BNC) = P(A|C). 
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In words, this relation states that if C is known to have occurred, the additional 
knowledge that B also occurred does not change the probability of A. 

Interestingly, independence of two events A and B with respect to the 
unconditional probability law. does not imply conditional independence, and 
vice versa, as illustrated by the next two examples. 


Example 1.20. Consider two independent fair coin tosses, in which all four possible 
outcomes are equally likely. Let 


H = {1st toss is a head}, 
H2 = {2nd toss is a head}, 


D = {the two tosses have different results}. 


The events Hı and Hə are (unconditionally) independent. But 


1 1 
P(Hi|D)=5, P(H2|D)=5,  PUhnH|D)-0, 


so that P(Hi N H2| D) z P(Hi | D)P(H2|D), and Hi, H2 are not conditionally 
independent. 

This example can be generalized. For any probabilistic model, let A and B be 
independent events, and let C be an event such that P(C) » 0, P(A|C) » 0, and 
P(B|C) > 0, while AN BNC is empty. Then, A and B cannot be conditionally 
independent (given C) since P(AN B|C) = 0 while P(A| C) P(B|C) > O. 


Example 1.21. There are two coins, a blue and a red one. We choose one of 
the two at random, each being chosen with probability 1/2, and proceed with two 
independent tosses. The coins are biased: with the blue coin, the probability of 
heads in any given toss is 0.99, whereas for the red coin it is 0.01. 

Let B be the event that the blue coin was selected. Let also Hj be the event 
that the ith toss resulted in heads. Given the choice of a coin, the events Hı and 
H» are independent. because of our assumption of independent tosses. Thus, 


P(Hi H2| B) = P(Hi | B)P(H2 | B) = 0.99 - 0.99. 


On the other hand, the events Hı and H2 are not independent. Intuitively, if we 
are told that the first toss resulted in heads, this leads us to suspect that the blue 
coin was selected, in which case, we expect the second toss to also result in heads. 
Mathematically, we use the total probability theorem to obtain 
1 
P(H;) = P(B)P(H; | B) + P(B^)P(Hi | B°) = E -0.99 + 5.0.01 = 4 

as should be expected from symmetry considerations. Similarly, we have P( H2) = 
1/2. Now notice that 


P(Hi N H2) = P(B)P(Hi N H2| B) + P(B5)P(Hi N Hz | B^) 


1 1 1 
mus -0.99 - 0.99 + 2 -0.01- 0.01 = 3 
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Thus, P(Hi N H2) # P(Hi)P(H23), and the events Hi and H2 are dependent, even 
though they are conditionally independent given B. 


We now summarize. 


Independence 


e Two events A and B are said to be independent if 


P(An B) = P(A)P(B). 


If in addition, P(B) » 0, independence is equivalent to the condition 


P(A| B) = P(A). 


e If A and B are independent, so are A and B°. 


e Two events A and B are said to be conditionally independent, 
given another event C with P(C) > 0, if 


P(AnB|C) 2 P(AI|IC)P(B|C). 
If in addition, P(B AN C) > 0, conditional independence is equivalent 
to the condition 


P(A| BNC) » P(A|C). 


e Independence does not imply conditional independence, and vice versa. 





Independence of a Collection of Events 


The definition of independence can be extended to multiple events. 


Definition of Independence of Several Events 


We say that the events A1, A»,...., An are independent if 


P (n a) = I] P(A;), for every subset S of (1,2,...,n). 


ie S ies 





Sec. 1.5 Independence 39 


For the case of three events, A1, A2, and As, independence amounts to 
satisfying the four conditions 


P(A; N A2) = P(A1) P(A2), 
P(A, N A3) = P(A1) P(A3), 
P(A2 n A3) = P(A2) P(A3), 


P(A, NAN A3) = P(A1) P(A2) P(A3). 


The first three conditions simply assert that any two events are independent, 
a property known as pairwise independence. But the fourth condition is 
also important and does not follow from the first three. Conversely, the fourth 
condition does not imply the first three; see the two examples that follow. 


Example 1.22. Pairwise Independence does not Imply Independence. 
Consider two independent fair coin tosses, and the following events: 


Hi = (1st toss is a head}, 
Hə = {2nd toss is a head}, 
D = {the two tosses have different results}. 


The events Hı and H2 are independent, by definition. To see that Hı and D are 
independent, we note that 


| OP(fnD) 1/4 1. 
P(D| Hi) = Et 7127270 


Similarly, H2 and D are i On the other hand, we have 
P(H 1 H2ND)=0 zt =.>- = P(Hi)P(H2)P(D), 


and these three events are not independent. 


Example 1.23. The Equality P(A: N A» N A3) = P(A1) P(A2) P(A3) is not 
Enough for Independence. Consider two independent rolls of a fair six-sided 
die, and the following events: 


A = {1st roll is 1, 2, or 3}, 
B = {1st roll is 3, 4, or 5}, 


C = (the sum of the two rolls is 9). 


We have - T 
P(ANB) = 55:5 =P(A)P(B), 
P(ANC) = gc 5: 35 = P(A)P(C), 
P(BNC) = 5 d ; ; = = P(B)P(C). 
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Thus the three events A, B. and C are not independent. and indeed no two of these 
events are independent. On the other hand, we have 

io A 
36 2 2 36 





P(AN BNC) = = P(A)P(B)P(C). 


The intuition behind the independence of a collection of events is anal- 
ogous to the case of two events. Independence means that the occurrence or 
non-occurrence of any number of the events from that collection carries no 
information on the remaining events or their complements. For example, if the 
events A1, A2, Ag. A4 are independent, one obtains relations such as 


P(A, U A2 | A3 A4) = P(Ai1U A2) 


Or 
P(A, U AS | AS 1 A1) = P(A U AS); 


see the end-of-chapter problems. 
Reliability 


In probabilistic models of complex systems involving several components, it is 
often convenient to assume that the behaviors of the components are uncoupled 
(independent). This typically simplifies the calculations and the analysis, as 
illustrated in the following example. 


Example 1.24. Network Connectivity. A computer network connects two 
nodes A and B through intermediate nodes C, D, E, F, as shown in Fig. 1.15(a). 
For every pair of directly connected nodes. say i and 3, there is a given probability 
pij that the link from i to j is up. We assume that link failures are independent 
of each other. What is the probability that there is a path connecting A and B in 
which all links are up? 


Series conneetion 





Parallel connection 


fa) tb} 


Figure 1.15: (a) Network for Example 1.24. The number next to each link 
indicates the probability that the link is up. (b) Series and parallel connections 
of three components in a reliability problem. 
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This is a typical problem of assessing the reliability of a system consisting of 
components that can fail independently. Such a system can often be divided into 
subsystems, where each subsystem consists in turn of several components that are 
connected either in series or in parallel; see Fig. 1.15(b). 

Let a subsystem consist of components 1,2,...,m, and let p; be the prob- 
ability that component 7 is up (“succeeds”). Then, a series subsystem succeeds 
if all of its components are up, so its probability of success is the product of the 
probabilities of success of the corresponding components, i.e.. 


P (series subsystem succeeds) = pip2::: pm. 


A parallel subsystem succeeds if any one of its components succeeds, so its prob- 
ability of failure is the product of the probabilities of failure of the corresponding 
components, i.e., 
P(parallel subsystem succeeds) = 1 — P(parallel subsystem fails) 
—-1-(1-p)(1- p2): (17 Pm). 
Returning now to the network of Fig. 1.15(a), we can calculate the probabil- 
ity of success (a path from A to B is available) sequentially, using the preceding 
formulas, and starting from the end. Let us use the notation X — Y to denote the 
event that there is a (possibly indirect) connection from node X to node Y. Then. 
P(C > B) 21- (1- P(C > E and E > B))(1— P(C = F and F = B)) 
= 1 — (1 — pcepeB)(l — pcrpraB) 
= 1 — (1 — 0.8 - 0.9)(1 — 0.95 - 0.85) 
= 0.946, 


P(A 5 C and C > B) = P(A > C)P(C — B) = 0.9 - 0.946 = 0.851. 
P(A > D and D 5 B) = P(A > D)P(D > B) = 0.75 - 0.95 = 0.712, 


and finally we obtain the desired probability 


P(A — B) = 1- (1- P(A— C and C > B)) (1 - P(A — D and D > B)) 
= 1 — (1 — 0.851)(1 — 0.712) 
= 0.957. 


Independent Trials and the Binomial Probabilities 


If an experiment involves a sequence of independent but identical stages, we say 
that we have a sequence of independent trials. In the special case where there 
are only two possible results at each stage, we say that we have a sequence of 
independent Bernoulli trials. The two possible results can be anything, e.g., 
“it rains” or “it doesn’t rain,” but we will often think in terms of coin tosses and 
refer to the two results as “heads” (H) and “tails” (T). 
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Consider an experiment that consists of n independent tosses of a coin, in 
which the probability of heads is p, where p is some number between 0 and 1. In 
this context, independence means that the events Ai, Às,.... An are indepen- 
dent, where A; — (ith toss is a head). 

We can visualize independent Bernoulli trials by means of a sequential 
description, as shown in Fig. 1.16 for the case where n — 3. The conditional 
probability of any toss being a head, conditioned on the results of any preced- 
ing tosses, is p because of independence. Thus, by multiplying the conditional 
probabilities along the corresponding path of the tree, we see that any particular 
outcome (3-long sequence of heads and tails) that involves k heads and 3 — k 
tails has probability p*(1— p)3-*. This formula extends to the case of a general 
number n of tosses. We obtain that the probability of any particular n-long 
sequence that contains k heads and n — k tails is p*(1— p)"-*. for all k from 0 
to n. 


Prob mm pt 

Probe pul-gqi 
Prob = pi = pj 
Prob egul- pF 
Prob Pp >y 


Prob =p i = pr 


Proli= pl = n 


Prob 4(1- p)? 





Figure 1.16: Sequential description of an experiment involving three indepen- 
dent tosses of a coin. Along the branches of the tree. we record the corresponding 
conditional probabilities, and by the multiplication rule, the probability of ob- 
taining a particular 3-toss sequence is calculated by multiplying the probabilities 
recorded along the corresponding path of the tree. 


Let us now consider the probability 
p(k) 2 P(k heads come up in an n-toss sequence), 


which will play an important role later. We showed above that the probability 
of any given sequence that contains k heads is p*(1 — p)^-*, so we have 


p(k) = (æo apio 
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where we use the notation 
n 
( a = number of distinct n-toss sequences that contain k heads. 


The numbers (1) (read as “n choose k") are known as the binomial coeffi- 


cients, while the probabilities p(k) are known as the binomial probabilities. 
Using a counting argument, to be given in Section 1.6, we can show that 


n n! 
() "mec k —0,1,...,m, 


where for any positive integer ? we have 
1121-2: (1 —1):i 


and, by convention, 0! — 1. An alternative verification is sketched in the end-of- 
chapter problems. Note that the binomial probabilities p(k) must add to 1, thus 
showing the binomial formula 


n 


Y (ira -p= 


k=0 


Example 1.25. Grade of Service. An internet service provider has installed c 
modems to serve the needs of a population of n dialup customers. It is estimated 
that at a given time, each customer will need a connection with probability p, 
independent of the others. What is the probability that there are more customers 
needing a connection than there are modems? 

Here we are interested in the probability that more than c customers simul- 
taneously need a connection. It is equal to 


n 


Y. p(k), 


k=c+1 


where 
p(k) = (Jena pues 


are the binomial probabilities. For instance, if n = 100, p = 0.1, and c = 15, the 
probability of interest turns out to be 0.0399. 

This example is typical of problems of sizing a facility to serve the needs 
of a homogeneous population, consisting of independently acting customers. The 
problem is to select the facility size to guarantee a certain probability (sometimes 
called grade of service) that no user is left unserved. 
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1.6 COUNTING 


The calculation of probabilities often involves counting the number of outcomes 
in various events. We have already seen two contexts where such counting arises. 


(a) When the sample space 2 has a finite number of equally likely outcomes, 
so that the discrete uniform probability law applies. Then, the probability 
of any event A is given by 


number of elements of A 
P(A) = ce COEUR AM 
(4) number of elements of 2’ 


and involves counting the elements of A and of 22. 


(b) When we want to calculate the probability of an event A with a finite 
number of equally likely outcomes, each of which has an already known 
probability p. Then the probability of A is given by 


P(A) = p- (number of elements of A), 


and involves counting the number of elements of A. An example of this type 
is the calculation of the probability of k heads in n coin tosses (the binomial 
probabilities). We saw in the preceding section that the probability of each 
distinct sequence involving k heads is easily obtained, but the calculation 
of the number of all such sequences, to be presented shortly, requires some 
thought. 


While counting is in principle straightforward, it is frequently challenging; 
the art of counting constitutes a large portion of the field of combinatorics. In 
this section, we present the basic principle of counting and apply it to a number 
of situations that are often encountered in probabilistic models. 


The Counting Principle 


The counting principle is based on a divide-and-conquer approach, whereby the 
counting is broken down into stages through the use of a tree. For example, 
consider an experiment that consists of two consecutive stages. The possible 
results at the first stage are a1,Q2,...,@m; the possible results at the second 
stage are b;,b2,...,bn. Then, the possible results of the two-stage experiment 
are all possible ordered pairs (a;, bj), i — 1,...,m, 7 =1,...,n. Note that the 
number of such ordered pairs is equal to mn. This observation can be generalized 
as follows (see also Fig. 1.17). 
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Leaves 





ni Tio na thy 
choices choices choices choices 


Stage 1 Stage 2 Stage 3 Stage 4 


Figure 1.17: Illustration of the basic counting principle. The counting is carried 
out in r stages (T = 4 in the figure). The first stage has n; possible results. For 
every possible result at the first i — 1 stages, there are n, possible results at the 
ith stage. The number of leaves is nyno---n,. This is the desired count. 


The Counting Principle 
Consider a process that consists of r stages. Suppose that: 
(a) There are m possible results at the first stage. 


(b) For every possible result at the first stage, there are n2 possible results 
at the second stage. 


(c) More generally, for any sequence of possible results at the first 2 — 1 
stages, there are n; possible results at the ith stage. Then, the total 
number of possible results of the r-stage process is 


ning: Ti. 


Example 1.26. The Number of Telephone Numbers. A local telephone 
number is a 7-digit sequence, but the first digit has to be different from 0 or 1. 
How many distinct telephone numbers are there? We can visualize the choice of a 
sequence as a sequential process, where we select one digit at a time. We have a 
total of 7 stages, and a choice of one out of 10 elements at each stage, except for 
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the first stage where we only have 8 choices. Therefore, the answer is 


8-10-10---10 — 8- 106. 
—M— 


6 times 


Example 1.27. The Number of Subsets of an n-Element Set. Consider 
an n-element set {51,52,...,5n}. How many subsets does it have (including itself 
and the empty set)? We can visualize the choice of a subset as a sequential process 
where we examine one element at a time and decide whether to include it in the set 
or not. We have a total of n stages, and a binary choice at each stage. Therefore 
the number of subsets is 


2-2---2=2". 
s 
n times 


It should be noted that the Counting Principle remains valid even if each 
first-stage result leads to a different set of potential second-stage results, etc. The 
only requirement is that the number of possible second-stage results is constant, 
regardless of the first-stage result. 

In what follows, we will focus primarily on two types of counting arguments 
that involve the selection of k objects out of a collection of n objects. If the order 
of selection matters, the selection is called a permutation, and otherwise, it is 
called à combination. We will then discuss a more general type of counting, 
involving a partition of a collection of n objects into multiple subsets. 


k-permutations 


We start with n distinct objects, and let k be some positive integer, with k < n. 
We wish to count the number of different ways that we can pick k out of these 
n objects and arrange them in a sequence, i.e., the number of distinct k-object 
sequences. We can choose any of the n objects to be the first one. Having chosen 
the first, there are only n — 1 possible choices for the second; given the choice of 
the first two, there only remain n — 2 available objects for the third stage, etc. 
When we are ready to select the last (the kth) object, we have already chosen 
k — 1 objects, which leaves us with n — (k — 1) choices for the last one. By the 
Counting Principle, the number of possible sequences, called k-permutations, 
is 
n(n —1)-:-(n—k-- 1)(Ün— k)--.2:1 
n! 
~ (n-k)! 
In the special case where k = n, the number of possible sequences, simply called 
permutations, is 
n(n—l)(n-2)---2-l=n!. 


(Let k = n in the formula for the number of k-permutations, and recall the 
convention 0! = 1.) 
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Example 1.28. Let us count the number of words that consist of four distinct 
letters. This is the problem of counting the number of 4-permutations of the 26 
letters in the alphabet. The desired number is 


! 
——_ = —— = 26- 25 - 24 - 23 = 358,800. 
(n-k)! 22! ue x 


The count for permutations can be combined with the Counting Principle 
to solve more complicated counting problems. 


Example 1.29. You have n; classical music CDs, n2 rock music CDs, and n3 
country music CDs. In how many different ways can you arrange them so that the 
CDs of the same type are contiguous? 

We break down the problem in two stages, where we first select the order of 
the CD types, and then the order of the CDs of each type. There are 3! ordered se- 
quences of the types of CDs (such as classical/rock/country, rock/country/classical, 
etc.), and there are ni! (or no!. or n3!) permutations of the classical (or rock. or 
country, respectively) CDs. Thus for each of the 3! CD type sequences, there are 
ni! n2! n3! arrangements of CDs. and the desired total number is 3! nj! n2! n3!. 

Suppose now that you offer to give k; out of the n, CDs of each type i to a 
friend, where k; < ni, i = 1.2. 3. What is the number of all possible arrangements 
of the CDs that you are left with? The solution is similar, except that the number of 
(ni — ki)-permutations of CDs of type i replaces n;! in the estimate, so the number 
of possible arrangements is 


Combinations 


There are n people and we are interested in forming a committee of k. How 
many different committees are possible? More abstractly, this is the same as the 
problem of counting the number of k-element subsets of a given n-element set. 
Notice that forming a combination is different than forming a k-permutation. 
because in a combination there is no ordering of the selected elements. 
For example, whereas the 2-permutations of the letters A, B, C, and D are 


AB. BA, AC, CA, AD, DA, BC. CB, BD. DB, CD, DC. 
the combinations of two out of these four letters are 
AB, AC, AD, BC, BD. CD. 


In the preceding example, the combinations are obtained from the per- 
mutations by grouping together "duplicates"; for example, AB and BA are not 
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viewed as distinct, and are both associated with the combination AB. This rea- 
soning can be generalized: each combination is associated with k! “duplicate” 
k-permutations, so the number n!/(n — k)! of k-permutations is equal to the 
number of combinations times k!. Hence, the number of possible combinations, 
is equal to 
n! 
k! (n — k)! 

Let us now relate the above expression to the binomial coefficient, which 
was denoted by (1) and was defined in the preceding section as the number of 
n-toss sequences with k heads. We note that specifying an n-toss sequence with 
k heads is the same as selecting k elements (those that correspond to heads) out 
of the n-element set of tosses, i.e., a combination of k out of n objects. Hence, 
the binomial coefficient is also given by the same formula and we have 


Example 1.30. The number of combinations of two out of the four letters A, B. 
C, and D is found by letting n = 4 and k = 2. It is 


4 4! 
= —— = 6 
(3) 2! 2! : 


consistent with the listing given earlier. 


It is worth observing that counting arguments sometimes lead to formulas 
that are rather difficult to derive algebraically. One example is the binomial 


formula : 
5 I RET rose ca 


k=0 


discussed in Section 1.5. In the special case where p = 1/2, this formula becomes 


$)» 


k=0 


and admits the following simple interpretation. Since (5) is the number of k- 
element subsets of a given n-element subset. the sum over k of (7) counts the 
number of subsets of all possible cardinalities. It is therefore equal to the number 


of all subsets of an n-element set. which is 2”. 


Example 1.31. We have a group of n persons. Consider clubs that consist of a 
special person from the group (the club leader) and a number (possibly zero) of 
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additional club members. Let us count the number of possible clubs of this type in 
two different ways, thereby obtaining an algebraic identity. 

There are n choices for club leader. Once the leader is chosen, we are left 
with a set of n — 1 available persons, and we are free to choose any of the 27^! 
subsets. Thus the number of possible clubs is n2"~!. 

Alternatively, for fixed k, we can form a k-person club by first selecting k out 
of the n available persons [there are (1) choices]. We can then select one of the 
members to be the leader (there are k choices). By adding over all possible club 
sizes k, we obtain the number of possible clubs as ae k("), thereby showing the 


identity 
n E n—1 
) (1) =n? 


Partitions 


Recall that a combination is a choice of k elements out of an n-element set 
without regard to order. Thus, a combination can be viewed as a partition of 
the set in two: one part contains k elements and the other contains the remaining 
n — k. We now generalize by considering partitions into more than two subsets. 

We are given an n-element set and nonnegative integers n1, n2,....Nr. 
whose sum is equal to n. We consider partitions of the set into r disjoint subsets. 
with the ith subset containing exactly n; elements. Let us count in how many 
ways this can be done. 

We form the subsets one at a time. We have (5) ways of forming the 


first subset. Having formed the first subset, we are left with n — nı elements. 
We need to choose n2 of them in order to form the second subset, and we have 


(o) choices, etc. Using the Counting Principle for this r-stage process, the 


total number of choices is 


n n— ni n — nı — nə n — ni —'::— Nr-1 
nı n2 n3 Nr ] 


which is equal to 


n! (n — ni)! (n—-ni—:::—mn,—1)! 


ni!(n —ni)! n2!(n—m;i — ng)! (n = ni =i Nnr- S Nnr)! nr! 
We note that several terms cancel and we are left with 


n! 
ni!n3l een, 


This is called the multinomial coefficient and is usually denoted by 


n 
NIN... Nr 
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Example 1.32. Anagrams. How many different words (letter sequences) can be 
obtained by rearranging the letters in the word TATTOO? There are six positions 
to be filled by the available letters. Each rearrangement corresponds to a partition 
of the set of the six positions into a group of size 3 (the positions that get the letter 
T), a group of size 1 (the position that gets the letter A), and a group of size 2 (the 
positions that get the letter O). Thus, the desired number is 

6! 1-2-3-4-5-6 _ 


112131 — 1-1.2.1.2.3 7 99 


It is instructive to derive this answer using an alternative argument. (This 
argument can also be used to rederive the multinomial coefficient formula: see 
the end-of-chapter problems.) Let us write TATTOO in the form T; AT2T30102 
pretending for a moment that we are dealing with 6 distinguishable objects. These 
6 objects can be rearranged in 6! different ways. However, any of the 3! possible 
permutations of T;, T2, and T3, as well as any of the 2! possible permutations of 
O; and Oz, lead to the same word. Thus, when the subscripts are removed, there 
are only 6!/(3! 2!) different words. 


Example 1.33. A class consisting of 4 graduate and 12 undergraduate students 
is randomly divided into four groups of 4. What is the probability that each group 
includes a graduate student? This is the same as Example 1.11 in Section 1.3, but 
we will now obtain the answer using a counting argument. 

We first determine the nature of the sample space. A typical outcome is a 
particular way of partitioning the 16 students into four groups of 4. We take the 
term “randomly” to mean that every possible partition is equally likely, so that the 
probability question can be reduced to one of counting. 

According to our earlier discussion, there are 


16 _ 16! 
4,4,4,4)  4!4!4!4! 


different partitions, and this is the size of the sample space. 
Let us now focus on the event that each group contains a graduate student. 
Generating an outcome with this property can be accomplished in two stages: 


(a) Take the four graduate students and distribute them to the four groups; there 
are four choices for the group of the first graduate student, three choices for 
the second, two for the third. Thus, there is a total of 4! choices for this stage. 


(b) Take the remaining 12 undergraduate students and distribute them to the 
four groups (3 students in each). This can be done in 


12 \ 12 
3,3,3,3)  3!3!3!3! 


By the Counting Principle, the event of interest can occur in 


different ways. 


4! 12! 
3!3!3!3! 
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different ways. The probability of this event is 


4! 12! 
3! 3! 31 3! 
16! 

4! 4! 4! 4! 


After some cancellations, we find that this is equal to 


12.8.4 
15-14-13’ 


consistent with the answer obtained in Example 1.11. 


Here is a summary of all the counting results we have developed. 


Summary of Counting Results 


e Permutations of n objects: n!. 
e k-permutations of n objects: n!/(n — k)!. 


n! 
— k! (n-k)! 
e Partitions of n objects into r groups, with the ith group having n; 


objects: 
n n! 
11,n2,..., Tr ni! n3! - -- nr! 


e Combinations of k out of n objects: (i) 





SUMMARY AND DISCUSSION 


A probability problem can usually be broken down into a few basic steps: 


(a) The description of the sample space, that is, the set of possible outcomes 
of a given experiment. 


(b) The (possibly indirect) specification of the probability law (the probability 
of each event). 


(c) The calculation of probabilities and conditional probabilities of various 
events of interest. 


The probabilities of events must satisfy the nonnegativity, additivity, and nor- 
malization axioms. In the important special case where the set of possible out- 
comes is finite, one can just specify the probability of each outcome and obtain 
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the probability of any event by adding the probabilities of the elements of the 
event. 

Given a probability law, we are often interested in conditional probabilities, 
which allow us to reason based on partial information about the outcome of 
the experiment. We can view conditional probabilities as probability laws of a 
special type, under which only outcomes contained in the conditioning event can 
have positive conditional probability. Conditional probabilities can be derived 
from the (unconditional) probability law using the definition P(A| B) = P(AN 
B)/P(B). However, the reverse process is often convenient, that is, first specify 
some conditional probabilities that are natural for the real situation that we wish 
to model, and then use them to derive the (unconditional) probability law. 

We have illustrated through examples three methods for calculating prob- 
abilities: 

(a) The counting method. This method applies to the case where the num- 
ber of possible outcomes is finite, and all outcomes are equally likely. To 
calculate the probability of an event, we count the number of elements of 
the event and divide by the number of elements of the sample space. 


(b) The sequential method. This method applies when the experiment has a 
sequential character, and suitable conditional probabilities are specified or 
calculated along the branches of the corresponding tree (perhaps using the 
counting method). The probabilities of various events are then obtained 
by multiplying conditional probabilities along the corresponding paths of 
the tree, using the multiplication rule. 


(c) The divide-and-conquer method. Here, the probabilities P(B) of vari- 
ous events B are obtained from conditional probabilities P(B | Aj), where 
the A; are suitable events that form a partition of the sample space and 
have known probabilities P(A;). The probabilities P(B) are then obtained 
by using the total probability theorem. 


Finally, we have focused on a few side topics that reinforce our main themes. 
We have discussed the use of Bayes' rule in inference, which is an important 
application context. We have also discussed some basic principles of counting 
and combinatorics, which are helpful in applying the counting method. 
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PROBLEMS 





SECTION 1.1. Sets 


Problem 1. Consider rolling a six-sided die. Let A be the set of outcomes where the 
roll is an even number. Let B be the set of outcomes where the roll is greater than 3. 
Calculate and compare the sets on both sides of De Morgan's laws 


(AUB) — A'nB.  (AnB) -A'UB.. 


Problem 2. Let A and B be two sets. 
(a) Show that 


Ao =(ASNB)U(ASN BS), B'-(AnB')u(A*n B°). 


(b) Show that 
(AN B) = (ASN B) uU (Af N BJU (AN BS). 


(c) Consider rolling a fair six-sided die. Let A be the set of outcomes where the roll 
is an odd number. Let B be the set of outcomes where the roll is less than 4. 
Calculate the sets on both sides of the equality in part (b), and verify that the 
equality holds. 


Problem 3.* Prove the identity 
AU (zz Bn) 2 N (AU Bn). 


Solution. If r belongs to the set on the left, there are two possibilities. Either r € A, 
in which case z belongs to all of the sets A U Ban. and therefore belongs to the set on 
the right. Alternatively. r belongs to all of the sets B, in which case. it belongs to all 
of the sets A U Bn. and therefore again belongs to the set on the right. 

Conversely. if z belongs to the set on the right. then it belongs to AU B, for all 
n. If x belongs to A. then it belongs to the set on the left. Otherwise. r must belong 
to every set B,, and again belongs to the set on the left. 


Problem 4.* Cantor's diagonalization argument. Show that the unit interval 
[0, 1] is uncountable. i.e., its elements cannot be arranged in a sequence. 


Solution. Any number z in [0. 1] can be represented in terms of its decimal expansion. 
e.g., 1/3 = 0.3333---. Note that most numbers have a unique decimal expansion, 
but there are a few exceptions. For example, 1/2 can be represented as 0.5000. -- or 
as 0.49999... It can be shown that this is the only kind of exception, i.e.. decimal 
expansions that end with an infinite string of zeroes or an infinite string of nines. 
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Suppose, to obtain a contradiction, that the elements of [0,1] can be arranged 
in a sequence T1, T2,T3,..., SO that every element of [0,1] appears in the sequence. 
Consider the decimal expansion of z,,: 


In = 0.ala2a3 tee 

where each digit ai belongs to {0,1,...,9}. Consider now a number y constructed as 
follows. The nth digit of y can be 1 or 2, and is chosen so that it is different from the 
nth digit of zn. Note that y has a unique decimal expansion since it does not end with 
an infinite sequence of zeroes or nines. The number y differs from each rn, since it has 
a different nth digit. Therefore, the sequence zi, r2, ... does not exhaust the elements 
of [0, 1], contrary to what was assumed. The contradiction establishes that the set [0, 1] 
is uncountable. 


SECTION 1.2. Probabilistic Models 


Problem 5. Out of the students in a class, 60% are geniuses, 70% love chocolate, 
and 4096 fall into both categories. Determine the probability that a randomly selected 
student is neither a genius nor a chocolate lover. 


Problem 6. A six-sided die is loaded in a way that each even face is twice as likely 
as each odd face. All even faces are equally likely, as are all odd faces. Construct a 
probabilistic model for a single roll of this die and find the probability that the outcome 
is less than 4. 


Problem "7. A four-sided die is rolled repeatedly, until the first time (if ever) that an 
even number is obtained. What is the sample space for this experiment? 


Problem 8. You enter a special kind of chess tournament, in which you play one game 
with each of three opponents, but you get to choose the order in which you play your 
opponents, knowing the probability of a win against each. You win the tournament if 
you win two games in a row, and you want to maximize the probability of winning. 
Show that it is optimal to play the weakest opponent second, and that the order of 
playing the other two opponents does not matter. 


Problem 9. A partition of the sample space Q is a collection of disjoint events 
$,..., Sn such that Q = UL, Si. 


(a) Show that for any event A, we have 


n 


P(A) = 5 P(AN Si). 


i=l 
(b) Use part (a) to show that for any events A, B, and C, we have 


P(A) = P(AN B) x P(An C) x P(An B° NC:)-P(ANBNC). 


Problem 10. Show the formula 


P((An B*)u (4* n B)) = P(A) + P(B) - 2P(An B), 
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which gives the probability that exactly one of the events A and B will occur. [Compare 
with the formula P(A U B) = P(A) + P(B) — P(An B), which gives the probability 
that at least one of the events A and B will occur.] 
Problem 11.* Bonferroni's inequality. 
(a) Prove that for any two events A and B, we have 
P(An B) > P(A) + P(B) - 1. 
(b) Generalize to the case of n events Ai, A2,..., An, by showing that 
P(A NAN: N An) > P(A1) + P(A2) +: + P(An) - (n - 1). 
Solution. We have P(AU B) = P(A) - P(B) - P(An B) and P(AU B) < 1. which 
implies part (a). For part (b), we use De Morgan's law to obtain 
1 - P(Ain-- n An) = P((Ain--: n As))) 
= P(AiuU---U Aj?) 
€ P(Aj) +: + P(A7) 
= (1- P(A))) +--+ (1- P(An)) 
—n-P(A))---:— P(A;). 


Problem 12.* The inclusion-exclusion formula. Show the following generaliza- 
tions of the formula 


P(AU B) - P(A) - P(B) - P(An B). 
(a) Let A, B, and C be events. Then, 
P(AUBUC) = P(A)+P(B)+P(C)—P(ANB)—P(BnC)—P(ANC)+P(ANBNC). 


(b) Let A1, Ao,...,An be events. Let Sı = {i|1 <i € n}, S2 = {(i1,i2)|1 < à < 
ig < n), and more generally, let Sm be the set of all m-tuples (71,...,im) of 
indices that satisfy 1 < 4; < i2 <-++ < im € n. Then. 


P (UR=14k)= XO P(A)- M P(A N An) 
i€ 3| (11,i12)€ S2 


+ M O P(A NAg N Aig) + (= 1)" P (n A). 
(11.12.13)€ 53 
Solution. (a) We use the formulas P(X UY) = P(X) 4 P(Y) - P(X NY) and 
(AU B)nC (An C)U(BnC). We have 


P(AU BUC) = P(AUB) - P(C) - P((AU B) C) 
= P(AU B) + P(C) - P((ANC)U(BNC)) 
= P(AU B) + P(C) - P(ANC) - P(Bn C) -x P(An Bn C) 
= P(A) + P(B) - P(An B) + P(C) - P(An C) - P(BnC) 
+P(AN BNC) 
= P(A) + P(B) + P(C) - P(An B) - P(BNC)-P(ANC) 
4 P(An BNC). 
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(b) Use induction and verify the main induction step by emulating the derivation of 
part (a). For a different approach, see the problems at the end of Chapter 2. 


Problem 13.* Continuity property of probabilities. 


(a) Let A1, A2,... bean infinite sequence of events, which is “monotonically increas- 
ing,” meaning that An C An+: for every n. Let A = UR, An. Show that 
P(A) = limn—o P(An). Hint: Express the event A as a union of countably 
many disjoint sets. 


(b) Suppose now that the events are “monotonically decreasing,” i.e., An41 C An 
for every n. Let A = 1 An. Show that P(A) = lima. P(An). Hint: Apply 
the result of part (a) to the complements of the events. 


(c) Consider a probabilistic model whose sample space is the real line. Show that 


P ([0, o9)) = lim P ([0, n]), and lim P ([n, oc)) = 0. 


Solution. (a) Let Bı = Ai and, for n > 2, B, = An N Af_,. The events B, are 
disjoint, and we have Ug; Bk = An, and Ugz, Bk = A. We apply the additivity axiom 
to obtain 


P(A) = M P(B.) = lim $  P(By) = lim P(U Bk) = lim P(A,). 
k=1 


noo noc 
k=1 


(b) Let Cn = Af, and C = A®. Since A441 C An, we obtain Cn C C541, and the events 
C, are increasing. Furthermore, C = A^ = (NRC An) = URAR = UPLiCh. Using 
the result from part (a) for the sequence Cn, we obtain 


1 — P(A) = P(A4)) = P(C) = lim P(C,) = lim (1— P(An)), 


n—oc n— oO 


from which we conclude that P(A) = limn—œæ P (An). 


(c) For the first equality, use the result from part (a) with An = [0, n] and A = [0, 00). 
For the second, use the result from part (b) with An = [n, oc) and A = NJL) As = Ø. 


SECTION 1.3. Conditional Probability 
Problem 14. We roll two fair 6-sided dice. Each one of the 36 possible outcomes is 
assumed to be equally likely. 

(a) Find the probability that doubles are rolled. 


(b) Given that the roll results in a sum of 4 or less, find the conditional probability 
that doubles are rolled. 


(c) Find the probability that at least one die roll is a 6. 
(d) Given that the two dice land on different numbers, find the conditional probability 


that at least one die roll is a 6. 


Problem 15. A coin is tossed twice. Alice claims that the event of two heads is at 
least as likely if we know that the first toss is a head than if we know that at least one 
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of the tosses is a head. Is she right? Does it make a difference if the coin is fair or 
unfair? How can we generalize Alice's reasoning? 


Problem 16. We are given three coins: one has heads in both faces, the second has 
tails in both faces, and the third has a head in one face and a tail in the other. We 
choose a coin at random, toss it, and the result is heads. What is the probability that 
the opposite face is tails? 


Problem 17. A batch of one hundred items is inspected by testing four randomly 
selected items. If one of the four is defective, the batch is rejected. What is the 
probability that the batch is accepted if it contains five defectives? 


Problem 18. Let A and B be events. Show that P(AN B | B) = P(A|B), assuming 
that P(B) » 0. 


SECTION 1.4. Total Probability Theorem and Bayes’ Rule 


Problem 19. Alice searches for her term paper in her filing cabinet. which has several 
drawers. She knows that she left her term paper in drawer j with probability p; > 0. 
The drawers are so messy that even if she correctly guesses that the term paper is in 
drawer i. the probability that she finds it is only di. Alice searches in a particular 
drawer. say drawer i. but the search is unsuccessful. Conditioned on this event, show 
that the probability that her paper is in drawer 7, is given by 
pi(1 — di) 


Pj PE 
f 
ifj #2 [cpd 


Problem 20. How an inferior player with a superior strategy can gain an 
advantage. Boris is about to play a two-game chess match with an opponent, and 
wants to find the strategy that maximizes his winning chances. Each game ends with 
either a win by one of the players, or a draw. If the score is tied at the end of the two 
games, the match goes into sudden-death mode, and the players continue to play until 
the first time one of them wins a game (and the match). Boris has two playing styles. 
timid and bold, and he can choose one of the two at will in each game. no matter what 
style he chose in previous games. With timid play. he draws with probability pa > 0, 
and he loses with probability 1 — pa. With bold play. he wins with probability pw, and 
he loses with probability 1 — pw. Boris will always play bold during sudden death, but 
may switch style between games 1 and 2. 


(a) Find the probability that Boris wins the match for each of the following strategies: 
(i) Play bold in both games 1 and 2. 
(ii) Play timid in both games 1 and 2. 
(iii) Play timid whenever he is ahead in the score. and play bold otherwise. 


(b) Assume that pw < 1/2, so Boris is the worse player, regardless of the playing 
style he adopts. Show that with the strategy in (iii) above, and depending on 
the values of pw and pa. Boris may have a better than a 50-50 chance to win the 
match. How do you explain this advantage? 


Problem 21. Two players take turns removing a ball from a jar that initially contains 
m white and n black balls. The first player to remove a white ball wins. Develop a 
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recursive formula that allows the convenient computation of the probability that the 
starting player wins. 


Problem 22. Each of k jars contains m white and n black balls. A ball is randomly 
chosen from jar 1 and transferred to jar 2, then a ball is randomly chosen from jar 2 
and transferred to jar 3, etc. Finally, a ball is randomly chosen from jar k. Show that 
the probability that the last ball is white is the same as the probability that the first 
ball is white, i.e., it is m/(m +n). 


Problem 23. We have two jars, each initially containing an equal number of balls. 
We perform four successive ball exchanges. In each exchange, we pick simultaneously 
and at random a ball from each jar and move it to the other jar. What is the probability 
that at the end of the four exchanges all the balls will be in the jar where they started? 


Problem 24. The prisoner's dilemma. The release of two out of three prisoners 
has been announced. but their identity is kept secret. One of the prisoners considers 
asking a friendly guard to tell him who is the prisoner other than himself that will be 
released, but hesitates based on the following rationale: at the prisoner's present state 
of knowledge, the probability of being released is 2/3, but after he knows the answer, 
the probability of being released will become 1/2, since there will be two prisoners 
(including himself) whose fate is unknown and exactly one of the two will be released. 
What is wrong with this line of reasoning? 


Problem 25. A two-envelopes puzzle. You are handed two envelopes. and you 
know that each contains a positive integer dollar amount and that the two amounts are 
different. The values of these two amounts are modeled as constants that are unknown. 
Without knowing what the amounts are, you select at random one of the two envelopes, 
and after looking at the amount inside, you may switch envelopes if you wish. A friend 
claims that the following strategy will increase above 1/2 your probability of ending 
up with the envelope with the larger amount: toss a coin repeatedly. let X be equal to 
1/2 plus the number of tosses required to obtain heads for the first time, and switch 
if the amount in the envelope you selected is less than the value of X. Is your friend 
correct? 


Problem 26. The paradox of induction. Consider a statement whose truth is 
unknown. If we see many examples that are compatible with it, we are tempted to 
view the statement as more probable. Such reasoning is often referred to as induc- 
tive inference (in a philosophical, rather than mathematical sense). Consider now the 
statement that "all cows are white." An equivalent statement is that "everything that 
is not white is not a cow." We then observe several black crows. Our observations are 
clearly compatible with the statement. but do they make the hypothesis “all cows are 
white” more likely? 

To analyze such a situation, we consider a probabilistic model. Let us assume 
that there are two possible states of the world, which we model as complementary 


events: 
A : all cows are white, 


A* : 50% of all cows are white. 


Let p be the prior probability P(A) that all cows are white. We make an observation 
of a cow or a crow, with probability q and 1 — q, respectively, independent of whether 
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event A occurs or not. Assume that 0 < p < 1, 0 <q < 1, and that all crows are black. 
(a) Given the event B = {a black crow was observed), what is P(A| B)? 
(b) Given the event C = {a white cow was observed), what is P(A| C)? 
Problem 27. Alice and Bob have 2n + 1 coins, each coin with probability of heads 
equal to 1/2. Bob tosses n -- 1 coins, while Alice tosses the remaining n coins. Assuming 


independent coin tosses, show that the probability that after all coins have been tossed, 
Bob will have gotten more heads than Alice is 1/2. 


Problem 28.* Conditional version of the total probability theorem. Let 
C1,..., C4 be disjoint events that form a partition of the state space. Let also A and 
B be events such that P(BNC;) > 0 for all i. Show that 


P(A| B) = $ P(Ci| BP(A| Bn C). 


i=l 


Solution. We have " 
P(AnB)- y P((4 n B)n Ci), 
i=l 
and by using the multiplication rule, 
P(((An B)nC;) = P(B)P(C:| B)P(A| Bn Ci). 
Combining these two equations, dividing by P(B), and using the formula P(A| B) = 
P(An B)/P(B), we obtain the desired result. 


Problem 29.* Let A and B be events with P(A) > 0 and P(B) > 0. We say that 
an event B suggests an event A if P(A| B) > P(A), and does not suggest event A if 
P(A| B) « P(A). 

(a) Show that B suggests A if and only if A suggests B. 


(b) Assume that P(B^) » 0. Show that B suggests A if and only if B^ does not 
suggest A. 


(c) We know that a treasure is located in one of two places, with probabilities 8 and 
1 — 8, respectively, where 0 < 8 < 1. We search the first place and if the treasure 
is there, we find it with probability p » 0. Show that the event of not finding the 
treasure in the first place suggests that the treasure is in the second place. 


Solution. (a) We have P(A| B) = P(ANB)/P(B), so B suggests A if and only if 
P(An B) » P(A)P(B), which is equivalent to A suggesting B, by symmetry. 


(b) Since P(B) + P(B^) = 1, we have 
P(B)P(A) + P(B*)P(A) = P(A) = P(B)P(A|B) + P(B*)P(A| B°), 
which implies that 


P(B^)(P(A) - P(A| B°)) = P(B)(P(A| B) - P(A)). 
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Thus, P(A| B) > P(A) (B suggests A) if and only if P(A) > P(A| B*) (B* does not 
suggest A). 


(c) Let A and B be the events 


A = {the treasure is in the second place}, 


B = (we don't find the treasure in the first place]. 
Using the total probability theorem. we have 
P(B) = P(A‘)P(B| A‘) + P(A)P(B|A) = &(1 — p) + (1 - 8), 


SO 
"it Us a LL NEL E 
P(A|B) = PI OD demere B= P(A). 


It follows that event B suggests event A. 


SECTION 1.5. Independence 


Problem 30. A hunter has two hunting dogs. One day, on the trail of some animal, 
the hunter comes to a place where the road diverges into two paths. He knows that 
each dog. independent of the other. will choose the correct path with probability p. 
The hunter decides to let each dog choose a path, and if they agree, take that one, and 
if they disagree, to randomly pick a path. Is his strategy better than just letting one 
of the two dogs decide on a path? 


Problem 31. Communication through a noisy channel. A source transmits a 
message (a string of symbols) through a noisy communication channel. Each symbol is 
0 or 1 with probability p and 1 — p, respectively, and is received incorrectly with prob- 
ability eo and ei, respectively (see Fig. 1.18). Errors in different symbol transmissions 
are independent. 





l -8 


Figure 1.18: Error probabilities in a binary communication channel. 


(a) What is the probability that the kth symbol is received correctly? 
(b) What is the probability that the string of symbols 1011 is received correctly? 


(c) In an effort to improve reliability, each symbol is transmitted three times and 
the received string is decoded by majority rule. In other words, a 0 (or 1) is 


Problems 61 


transmitted as 000 (or 111, respectively), and it is decoded at the receiver as a 0 
(or 1) if and only if the received three-symbol string contains at least two Os (or 
1s, respectively). What is the probability that a 0 is correctly decoded? 


(d) For what values of «o is there an improvement in the probability of correct de 
coding of a 0 when the scheme of part (c) is used? 


(e) Suppose that the scheme of part (c) is used. What is the probability that a 
symbol was 0 given that the received string is 101? 


Problem 32. The king's sibling. The king has only one sibling. What is the proba- 
bility that the sibling is male? Assume that every birth results in a boy with probability 
1/2, independent of other births. Be careful to state any additional assumptions you 
have to make in order to arrive at an answer. 


Problem 33. Using a biased coin to make an unbiased decision. Alice and Bob 
want to choose between the opera and the movies by tossing a fair coin. Unfortunately. 
the only available coin is biased (though the bias is not known exactly). How can they 
use the biased coin to make a decision so that either option (opera or the movies) is 
equally likely to be chosen? 


Problem 34. An electrical system consists of identical components. each of which 
is operational with probability p, independent of other components. The components 
are connected in three subsystems, as shown in Fig. 1.19. The system is operational 
if there is a path that starts at point A. ends at point B. and consists of operational 
components. What is the probability of this happening? 








Figure 1.19: A system of identical components that consists of the three sub- 
systems 1. 2, and 3. The system is operational if there is a path that starts at 
point A, ends at point B, and consists of operational components. 


Problem 35. Reliability of a k-out-of-n system. A svstem consists of n identical 
components, each of which is operational with probability p. independent of other 
components. The system is operational if at least k out of the n components are 
operational. What is the probability that the system is operational? 


Problem 36. A power utility can supply electricity to a city from n different power 
plants. Power plant i fails with probability p;, independent of the others. 
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(a) Suppose that any one plant can produce enough electricity to supply the entire 
city. What is the probability that the city will experience a black-out? 


(b) Suppose that two power plants are necessary to keep the city from a black-out. 
Find the probability that the city will experience a black-out. 


Problem 37. A cellular phone system services a population of nı “voice users" (those 
who occasionally need a voice connection) and n2 “data users" (those who occasionally 
need a data connection). We estimate that at a given time, each user will need to be 
connected to the system with probability pı (for voice users) or p2 (for data users), 
independent of other users. The data rate for a voice user is r1 bits/sec and for a data 
user is r2 bits/sec. The cellular system has a total capacity of c bits/sec. What is the 
probability that more users want to use the system than the system can accommodate? 


Problem 38. The problem of points. Telis and Wendy play a round of golf (18 
holes) for a $10 stake, and their probabilities of winning on any one hole are p and 
1 — p, respectively, independent of their results in other holes. At the end of 10 holes, 
with the score 4 to 6 in favor of Wendy, Telis receives an urgent call and has to report 
back to work. They decide to split the stake in proportion to their probabilities of 
winning had they completed the round, as follows. If pr and pw are the conditional 
probabilities that Telis and Wendy, respectively, are ahead in the score after 18 holes 
given the 4-6 score after 10 holes, then Telis should get a fraction pr /(pr + pw) of the 
stake, and Wendy should get the remaining pw /(pr + pw). How much money should 
Telis get? Note: This is an example of the, so-called, problem of points, which played 
an important historical role in the development of probability theory. The problem 
was posed by Chevalier de Méré in the 17th century to Pascal, who introduced the 
idea that the stake of an interrupted game should be divided in proportion to the 
players’ conditional probabilities of winning given the state of the game at the time of 
interruption. Pascal worked out some special cases and through a correspondence with 
Fermat, stimulated much thinking and several probability-related investigations. 


Problem 39. A particular class has had a history of low attendance. The annoyed 
professor decides that she will not lecture unless at least k of the n students enrolled 
in the class are present. Each student will independently show up with probability 
pg if the weather is good, and with probability pẹ if the weather is bad. Given the 
probability of bad weather on a given day, obtain an expression for the probability that 
the professor will teach her class on that day. 


Problem 40. Consider a coin that comes up heads with probability p and tails with 
probability 1 — p. Let gn be the probability that after n independent tosses, there have 
been an even number of heads. Derive a recursion that relates gn to qn-1, and solve 
this recursion to establish the formula 


qn = (1-- (1— 2p)")/2. 


Problem 41. Consider a game show with an infinite pool of contestants, where 
at each round i, contestant i obtains a number by spinning a continuously calibrated 
wheel. The contestant with the smallest number thus far survives. Successive wheel 
spins are independent and we assume that there are no ties. Let N be the round at 
which contestant 1 is eliminated. For any positive integer n, find P(N = n). 


Problems 63 


Problem 42.* Gambler's ruin. A gambler makes a sequence of independent bets. 
In each bet, he wins $1 with probability p, and loses $1 with probability 1 — p. Initially, 
the gambler has $k, and plays until he either accumulates $n or has no money left. 
What is the probability that the gambler will end up with $n? 


Solution. Let us denote by A the event that he ends up with $n, and by F the event 
that he wins the first bet. Denote also by wu, the probability of event A, if he starts 
with $k. We apply the total probability theorem to obtain 


wk = P(A | F)P(F) + P(A| F°)P(F°) = pP(A|F) -gP(A|F^)). 0<k<n, 


where q — 1 — p. By the independence of past and future bets, having won the first bet 
is the same as if he were just starting now but with $(k+1), so that P(A| F) = we41 
and similarly P(A|F*) = wx-1ı. Thus, we have wy = pwr41 + qwk-1, which can be 
written as 

Wk+1 — Wk = r(wk — wk-1), O<k<n, 


where r = q/p. We will solve for w, in terms of p and q using iteration, and the 
boundary values wo = 0 and v, = 1. 
We have wy41 — Wk = r* (uy — wo), and since wo = 0, 


k k-1 k k 
Wktl = Uk tT wi = We-1 +T w treuw =w +rwi +- +r w. 


The sum in the right-hand side can be calculated separately for the two cases where 
r= 1 (or p =q) and r £1 (or p#q). We have 





Since wn = 1, we can solve for wj and therefore for wx: 


l-r 


icm! if p 7 q, 

Wi = 1 
RD if p = q, 
n 

so that l 

l-r ; 
ET if p#q, 
-r 

Wk = i 
= if p = q. 
n 


Problem 43.* Let A and B be independent events. Use the definition of indepen- 
dence to prove the following: 


(a) The events A and B^ are independent. 
(b) The events A° and B° are independent. 
Solution. (a) The event A is the union of the disjoint events AN B* and An B. Using 


the additivity axiom and the independence of A and B, we obtain 


P(A) = P(An B) + P(AN B°) = P(A)P(B) + P(An B°). 
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It follows that 
P(An B^) = P(A)(1— P(B)) = P(A)P(P^). 
so A and B° are independent. 
(b) Apply the result of part (a) twice: first on A and B. then on B^ and A. 


Problem 44.* Let A. B. and C be independent events, with P(C) » 0. Prove that 
A and B are conditionally independent given C. 


Solution. We have 
P(ANBNC) 


P(C) 


_ P(A)P(B)P(C) 
i P(C) 


= P(A)P(B) 
= P(A|C)P(B|C), 


P(ANB|C) = 


so A and B are conditionally independent given C. In the preceding calculation, the 
first equality uses the definition of conditional probabilities; the second uses the as- 
sumed independence; the fourth uses the independence of A from C, and of B from C. 


Problem 45.* Assume that the events Aı, A2, A3, A4 are independent and that 
P(As N Aa) > 0. Show that 


P(A, U A2 | A3 N A4) = P(A U A3). 


Solution. We have 


P(Ai A3 N Aa) _ P(A1)P(A3)P(A14) 


P(A; | A3 N Ag) = ~ P(AsnA) P(A)P(A) 


= P(A1). 


We similarly obtain P (42 | As N A4) = P(A2) and P(A: N A2 | As As) = P(A1N A2), 
and finally, 


P(AiU A2 | A5 Aa) = P(Ai | A3 N Aa) + P(A2 | Aa N Aa) — P(Ai N A2 | As N Aa) 
= P(A1) + P(A2) — P(Ai N A2) 
= P(A, U A). 


Problem 46.* Laplace’s rule of succession. Consider m + 1 boxes with the kth 
box containing k red balls and m — k white balls, where k ranges from 0 to m. We 
choose a box at random (all boxes are equally likely) and then choose a ball at random 
from that box, n successive times (the ball drawn is replaced each time, and a new ball 
is selected independently). Suppose a red ball was drawn each of the n times. What 
is the probability that if we draw a ball one more time it will be red? Estimate this 
probability for large m. 


Solution. We want to find the conditional probability P(E | Rn), where E is the event 
of a red ball drawn at time n+ 1, and Rj, is the event of a red ball drawn each of the n 
preceding times. Intuitively, the consistent draw of a red ball indicates that a box with 
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a high percentage of red balls was chosen, so we expect that P(E | An) is closer to 1 
than to 0. In fact, Laplace used this example to calculate the probability that the sun 
will rise tomorrow given that it has risen for the preceding 5,000 years. (It is not clear 
how serious Laplace was about this calculation, but the story is part of the folklore of 
probability theory.) 

We have 
P(EN Ra) 


P(E| Rn) = P(Rn) ' 


and by using the total probability theorem, we obtain 


m kr 1 m NE 
P(R,) = S| P(kth box chosen) (=) EP y3 (=) 
x k=0 
1 m k nal 
P(EN Rn) =P(Ra)= (=) 


For large m, we can view P(FR4) as a piecewise constant approximation to an integral: 














1 w/k\? 1 T 1 mnt} 1 
a) ——— —]| £&——— dr = ————: RI : 
Ete) 12-1) crm | TNR (m+1)m™ n+1 n+l 
Similarly, 
1 
P n) = med cae 
(En Ra) = P(Ra+1) X —— 
so that i 
n+ 
P Rna) S : 
(E| Rn) n+2 


Thus, for large m, drawing a red ball one more time is almost certain when n is large. 
Problem 47.* Binomial coefficient formula and the Pascal triangle. 


(a) Use the definition of (7) as the number of distinct n-toss sequences with k 
heads, to derive the recursion suggested by the so called Pascal triangle, given in 
Fig. 1.20. 


(b) Use the recursion derived in part (a) and induction, to establish the formula 


Solution. (a) Note that n-toss sequences that contain k heads (for 0 < k < n) can be 
obtained in two ways: 


(1) By starting with an (n — 1)-toss sequence that contains k heads and adding a tail 
at the end. There are (^1) different sequences of this type. 
(2) By starting with an (n — 1)-toss sequence that contains k — 1 heads and adding 


n-1 


a head at the end. There are ( um )) different sequences of this type. 
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Figure 1.20: Sequential calculation method of the binomial coefficients using the 


Pascal triangle. Each term e in the triangular array on the left is computed 


and placed in the triangular array on the right by adding its two neighbors in the 
row above it (except for the boundary terms with k = 0 or k = n, which are equal 
to 1). 


Thus, 


nc n-1 : 
irs VEREN k ) ILE ed uamh, 
k 

1. if k — 0,n. 


This is the formula corresponding to the Pascal triangle calculation, given in Fig. 1.20. 


(b) We now use the recursion from part (a), to demonstrate the formula 


n\ n! 
k]  k!(n— k)? 
1 1 


by induction on n. Indeed, we have from the definition (3) = (1) = 1, so for n = 1 the 
above formula is seen to hold as long as we use the convention 0! = 1. If the formula 


holds for each index up to n — 1, we have for k = 1,2,...,n— l, 
ny [fn—1l P" n-—i 
kj \k-1 k 
_ (n — 1)! n (n — 1)! 
|» (k-1)!(n-1-k-c1)  k!(n-1-k)! 
Kk. n! qr. n! 
Tn k!(n- k) n k! (n — k)! 
n! 
~ k(n- k)! 


and the induction is complete. 


Problem 48.* The Borel-Cantelli lemma. Consider an infinite sequence of trials. 
The probability of success at the ith trial is some positive number pi. Let N be the 
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event that there is no success, and let J be the event that there is an infinite number 
of successes. 


(a) Assume that the trials are independent and that $77, p, = oo. Show that 
P(N) =0 and P(I) = 1. 


(b) Assume that 5 7, p, < oo. Show that P(I) = 0. 


Solution. (a) The event N is a subset of the event that there were no successes in the 
first n trials, so that 


P(N) < [ [a - pi. 


Taking logarithms, 
log P(N) < $` log(1 - p.) € 9 (-p:). 
i=l i=l 


Taking the limit as n tends to infinity, we obtain log P(N) = —oo. or P(N) =0. 

Let now Ln be the event that there is a finite number of successes and that the 
last success occurs at the nth trial. We use the already established result P(N) = 0, 
and apply it to the sequence of trials after trial n, to obtain P(L4) = 0. The event I° 
(finite number of successes) is the union of the disjoint events Ln, n > 1. and N, so 
that 


P(I*) = P(N) + V P(Ln) =0, 


and P(J) = 1. 


(b) Let S; be the event that the ith trial is a success. Fix some number n and for every 
i » n, let Fi be the event that the first success after time n occurs at time i. Note 
that F; C S;. Finally, let A, be the event that there is at least one success after time 
n. Note that J C An, because an infinite number of successes implies that there are 
successes subsequent to time n. Furthermore, the event An is the union of the disjoint 
events Fi, i > n. Therefore, 


P(I) < PUA) =P( U J => P(F)s V. P(S)2 M. p. 
i=n+1 i=n+l i=n+1 =n+1 


We take the limit of both sides as n — oo. Because of the assumption 38$ p «oo, 
the right-hand side converges to zero. This implies that P(7) = 0. 


SECTION 1.6. Counting 


Problem 49. De Méré's puzzle. A six-sided die is rolled three times independently. 
Which is more likely: à sum of 11 or a sum of 12? (This question was posed by the 
French nobleman de Méré to his friend Pascal in the 17th century.) 


Problem 50. The birthday problem. Consider n people who are attending a 
party. We assume that every person has an equal probability of being born on any day 
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during the year. independent of everyone else, and ignore the additional complication 
presented by leap years (i.e., assume that nobody is born on February 29). What is 
the probability that each person has a distinct birthday? 


Problem 51. An urn contains m red and n white balls. 


(a) We draw two balls randomly and simultaneously. Describe the sample space and 
calculate the probability that the selected balls are of different color, by using 
two approaches: a counting approach based on the discrete uniform law, and a 
sequential approach based on the multiplication rule. 


(b) We roll a fair 3-sided die whose faces are labeled 1,2,3, and if k comes up. we 
remove k balls from the urn at random and put them aside. Describe the sample 
space and calculate the probability that all of the balls drawn are red. using a 
divide-and-conquer approach and the total probability theorem. 


Problem 52. We deal from a well-shuffled 52-card deck. Calculate the probability 
that the 13th card is the first king to be dealt. 


Problem 53. Ninety students, including Joe and Jane, are to be split into three 
classes of equal size, and this is to be done at random. What is the probability that 
Joe and Jane end up in the same class? 


Problem 54. Twenty distinct cars park in the same parking lot every day. Ten of 
these cars are US-made. while the other ten are foreign-made. The parking lot has 
exactly twenty spaces. all in a row. so the cars park side by side. However. the drivers 
have varying schedules. so the position any car might take on a certain day is random. 


(a) In how many different ways can the cars line up? 


(b) What is the probability that on a given day, the cars will park in such a way 
that they alternate (no two US-made are adjacent and no two foreign-made are 
adjacent)? 


Problem 55. Fight rooks are placed in distinct squares of an 8 x 8 chessboard, with 
all possible placements being equally likely. Find the probability that all the rooks are 
safe from one another, i.e.. that there is no row or column with more than one rook. 


Problem 56. An academic department offers 8 lower level courses: (Li. L»,.... La] 
and 10 higher level courses: (Hi. H»..... Hio). A valid curriculum consists of 4 lower 
level courses. and 3 higher level courses. 


(a) How many different curricula are possible? 


(b) Suppose that (Hi...., H5) have Lı as a prerequisite, and (Ho,... Hio) have L2? 
and L3 as prerequisites. i.e.. any curricula which involve, say, one of {H1,..., Hs) 
must also include L1. How many different curricula are there? 


Problem 57. How many 6-word sentences can be made using each of the 26 letters 
of the alphabet exactly once? A word is defined as a nonempty (possibly jibberish) 
sequence of letters. 
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Problem 58. We draw the top 7 cards from a well-shuffled standard 52-card deck. 
Find the probability that: 


(a) The 7 cards include exactly 3 aces. 
(b) The 7 cards include exactly 2 kings. 


(c) The probability that the 7 cards include exactly 3 aces. or exactly 2 kings, or 
both. 


Problem 59. A parking lot contains 100 cars, k of which happen to be lemons. We 
select m of these cars at random and take them for a test drive. Find the probability 
that n of the cars tested turn out to be lemons. 


Problem 60. A well-shuffled 52-card deck is dealt to 4 players. Find the probability 
that each of the players gets an ace. 


Problem 61.* Hypergeometric probabilities. An urn contains n balls, out of 
which m are red. We select k of the balls at random. without replacement (i.e., selected 
balls are not put back into the urn before the next selection). What is the probability 
that i of the selected balls are red? 


Solution. The sample space consists of the (2) different ways that we can select k out 
of the available balls. For the event of interest to occur. we have to select i out of the 


m red balls, which can be done in (7) ways, and also select k — i out of the n — m balls 
"A 


m\ [n-m 
i k-i 
H 3 
k 
for i > 0 satisfying i € m, i € k, and k — i € n — m. For all other i, the probability is 
zero. 


that are not red, which can be done in ( ) ways. Therefore, the desired probability 


is 


Problem 62.* Correcting the number of permutations for indistinguishable 
objects. When permuting n objects, some of which are indistinguishable, different 
permutations may lead to indistinguishable object sequences, so the number of distin- 
guishable object sequences is less than n!. For example, there are six permutations of 
the letters A, B, and C: 


ABC. ACB, BAC, BCA, CAB, CBA, 


but only three distinguishable sequences that can be formed using the letters A, D, 
and D: 
ADD, DAD. DDA. 


(a) Suppose that & out of the n objects are indistinguishable. Show that the number 
of distinguishable object sequences is n!/ k!. 


(b) Suppose that we have r types of indistinguishable objects, and for each i, ki 
objects of type i. Show that the number of distinguishable object sequences is 


n! 


ky! kal- kel 
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Solution. (a) Each one of the n! permutations corresponds to k! duplicates which are 
obtained by permuting the k indistinguishable objects. Thus, the n! permutations can 
be grouped into n!/k! groups of k! indistinguishable permutations that result in the 
same object sequence. Therefore, the number of distinguishable object sequences is 
n!/k!. For example, the three letters A, D, and D give the 3! = 6 permutations 


ADD, ADD, DAD, DDA, DAD, DDA, 


obtained by replacing B and C by D in the permutations of A, B, and C given earlier. 
However, these 6 permutations can be divided into the n!/k! = 3!/2! = 3 groups 


{ADD, ADD}, {DAD, DAD}, {DDA, DDA}, 


each having k! = 2! = 2 indistinguishable permutations. 


(b) One solution is to extend the argument in (a) above: for each object type i, there are 
ki! indistinguishable permutations of the k; objects. Hence, each permutation belongs 
to a group of kı! k2!---k,! indistinguishable permutations, all of which yield the same 
object sequence. 

An alternative argument goes as follows. Choosing a distinguishable object se- 
quence is the same as starting with n slots and for each i, choosing the k; slots to be 
occupied by objects of type i. This is the same as partitioning the set (1,...,n) into 
groups of size k;,..., kr, and the number of such partitions is given by the multinomial 
coefficient. 
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BASIC CONCEPTS 


In many probabilistic models. the outcomes are numerical. e.g.. when they corre- 
spond to instrument readings or stock prices. In other experiments. the outcomes 
are not numerical. but they may be associated with some numerical values of 
interest. For example. if the experiment is the selection of students from a given 
population. we may wish to consider their grade point average. When dealing 
with such numerical values. it is often useful to assign probabilities to them. 
This is done through the notion of a random variable, the focus of the present 
chapter. 

Given an experiment and the corresponding set of possible outcoines (the 
sample space), a random variable associates a particular number with each out- 
come: see Fig. 2.1. We refer to this number as the numerical value or simply 
the value of the random variable. Mathematically. a random variable is a 
real-valued function of the experimental outcome. 







Random variable X 






Sam ple spior 


Q 





Real muuber line 


Randers, variable: 


X ow maxipum eod 


Real number se 








Figure 2.1: (a) Visualization of a random variable. It is a function that assigns 
a numerical value to each possible outcome of the experiment. (b) An example 
of a random variable. The experiment consists of two rolls of a 4-sided die, and 
the random variable is the maximum of the two rolls. 1f the outcome of the 
experiment is (4. 2). the value of this random variable is 4. 


Here are some examples of random variables: 


(a) In an experiment involving a sequence of 5 tosses of a coin. the number of 
heads in the sequence is a random variable. However, the 5-long sequence 
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of heads and tails is not considered a random variable because it does not 
have an explicit numerical value. 


(b) In an experiment involving two rolls of a die. the following are examples of 
random variables: 


(i) The sum of the two rolls. 
(ii) The number of sixes in the two rolls. 
(iii) The second roll raised to the fifth power. 


(c) In an experiment involving the transmission of a message, the time needed 
to transmit the message. the number of symbols received in error. and the 
delay with which the message is received are all random variables. 


There are several basic concepts associated with random variables. which 
are summarized below. These concepts will be discussed in detail in the present 
chapter. 


Main Concepts Related to Random Variables 
Starting with a probabilistic model of an experiment: 


e A random variable is a real-valued function of the outcome of the 
experiment. 


e A function of a random variable defines another random variable. 


e We can associate with each random variable certain “averages” of in- 
terest, such as the mean and the variance. 


e A random variable can be conditioned on an event or on another 
random variable. 


There is a notion of independence of a random variable from an 
event or from another random variable. 





A random variable is called discrete if its range (the set of values that it 
can take) is either finite or countably infinite. For example. the random variables 
mentioned in (a) and (b) above can take at most a finite number of numerical 
values, and are therefore discrete. 

A random variable that can take an uncountably infinite number of values 
is not discrete. For an example. consider the experiment of choosing a point 
a from the interval [-1.1]. The random variable that associates the numerical 
value a? to the outcome a is not discrete. On the other hand. the random variable 
that associates with a the numerical value 


1. ifa>0. 
sgn(a) = 0. ifa=0, 
-l. ifa<0. 


2.2 
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is discrete. 
In this chapter. we focus exclusively on discrete random variables, even 
though we will typically omit the qualifier “discrete.” 


Concepts Related to Discrete Random Variables 


Starting with a probabilistic model of an experiment: 


e A discrete random variable is a real-valued function of the outcome 
of the experiment that can take a finite or countably infinite number 
of values. 


e A discrete random variable has an associated probability mass func- 
tion (PMF), which gives the probability of each numerical value that 
the random variable can take. 


e A function of a discrete random variable defines another discrete 
random variable, whose PMF can be obtained from the PMF of the 
original random variable. 





We will discuss each of the above concepts and the associated methodology 
in the following sections. In addition. we will provide examples of some important 
and frequently encountered random variables. In Chapter 3, we will discuss 
general (not necessarily discrete) random variables. 

Even though this chapter may appear to be covering a lot of new ground, 
this is not really the case. The general line of development is to simply take 
the concepts from Chapter 1 (probabilities, conditioning, independence, etc.) 
and apply them to random variables rather than events, together with some 
convenient new notation. The only genuinely new concepts relate to means and 
variances. 


PROBABILITY MASS FUNCTIONS 


The most important way to characterize a random variable is through the prob- 
abilities of the values that it can take. For a discrete random variable X, these 
are captured by the probability mass function (PMF for short) of X, denoted 
px. In particular. if r is any possible value of X. the probability mass of zx. 
denoted px (x). is the probability of the event {X = z} consisting of all outcomes 
that give rise to a value of X equal to z: 


px(1) = P({X = 2). 


For example. let the experiment consist of two independent tosses of a fair coin, 
and let X be the number of heads obtained. Then the PMF of X is 
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1/4, iff=Oorz = 2, 
px(z)= < 1/2, ifz—1, 
0, otherwise. 


In what follows, we will often omit the braces from the event/set notation 
when no ambiguity can arise. In particular, we will usually write P(X = x) in 
place of the more correct notation P((.X = z)), and we will write P(X € S) 
for the probability that X takes a value within a set S. We will also adhere 
to the following convention throughout: we will use upper case characters 
to denote random variables, and lower case characters to denote real 
numbers such as the numerical values of a random variable. 


Note that 
X px(z) = 


where in the summation above, x ranges over all the possible numerical values of 
X. This follows from the additivity and normalization axioms: as x ranges over 
all possible values of X, the events {X = x) are disjoint and form a partition of 
the sample space. By a similar argument, for any set S of possible values of X, 


we have 
P(X € S) - M px(z 
reS 


For example, if X is the number of heads obtained in two independent tosses of 
a fair coin, as above, the probability of at least one head is 


al 
P(X >0)= Yo px de 


zm 


Calculating the PMF of X is conceptually straightforward, and is illus- 
trated in Fig. 2.2. 


Calculation of the PMF of a Random Variable X 


For each possible value x of X: 


1. Collect all the possible outcomes that give rise to the event (.X — x). 
2. Add their probabilities to obtain px (x). 





The Bernoulli Random Variable 


Consider the toss of a coin, which comes up a head with probability p, and a tail 
with probability 1 — p. The Bernoulli random variable takes the two values 1 
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Figure 2.2: (a) Illustration of the method to calculate the PMF of a random 
variable X. For each possible value z. we collect all the outcomes that give rise 
to X = x and add their probabilities to obtain px(z). (b) Calculation of the 
PMF px of the random variable X — maximum roll in two independent rolls 
of a fair 4-sided die. There are four possible values r, namely. 1, 2. 3, 4. To 
calculate px (x) for a given z, we add the probabilities of the outcomes that give 
rise to x. For example. there are three outcomes that give rise to r = 2, namely. 
(1.2). (2. 2). (2. 1). Each of these outcomes has probability 1/16. so p x (2) = 3/16, 
as indicated in the figure. 


and 0, depending on whether the outcome is a head or a tail: 


1. ifa head, 
— ie if a tail. 


Its PMF is 


"AN E: if k= 
px (k) = im if k — 0. 
For all its simplicity. the Bernoulli random variable is very important. In 
practice. it is used to model generic probabilistic situations with just two out- 
comes, such as: 
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pyi È) pyi k) 
Binomial PMF, n = 9. = 1/2 


Binomial PMF, 


n = large, p = small 











Figure 2.3: The PMF of a binomial random variable. If p = 1/2, the PMF is 
symmetric around n/2. Otherwise, the PMF is skewed towards 0 if p < 1/2. and 
towards n if p > 1/2. 


(a) The state of a telephone at a given time that can be either free or busy. 
(b) A person who can be either healthy or sick with a certain disease. 


(c) The preference of a person who can be either for or against a certain po- 
litical candidate. 


Furthermore, by combining multiple Bernoulli random variables, one can con- 
struct more complicated random variables, such as the binomial random variable, 
which is discussed next. 


The Binomial Random Variable 


A coin is tossed n times. At each toss, the coin comes up a head with probability 
p, and a tail with probability 1 — p, independent of prior tosses. Let X be the 
number of heads in the n-toss sequence. We refer to X as a binomial random 
variable with parameters n and p. The PMF of X consists of the binomial 
probabilities that were calculated in Section 1.5: 


TL 


px(k) = P(X =k) = (7 


Jaa - pyr k=0,1,.. N. 


(Note that here and elsewhere, we simplify notation and use k, instead of z, 
to denote the values of integer-valued random variables.) The normalization 
property, specialized to the binomial random variable, is written as 


» (rta a EE 


k=0 
Some special cases of the binomial PMF are sketched in Fig. 2.3. 


The Geometric Random Variable 


Suppose that we repeatedly and independently toss a coin with probability of 
a head equal to p, where 0 < p < 1. The geometric random variable is the 
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Figure 2.4: The PMF 
px(k) = (1 - p)*7!p, k=1,2,..., 


of a geometric random variable. It decreases as a geometric progression with 
parameter 1 — p. 


number X of tosses needed for a head to come up for the first time. Its PMF is 
given by 
px(k) = (1 — p)*-2p, a be eee 


since (1 — p)*-!p is the probability of the sequence consisting of k — 1 successive 
tails followed by a head; see Fig. 2.4. This is a legitimate PMF because 


1 


k) — p)k-} = ——— = 
2 ,Px(9 -Yu-» p- PY pic 


Naturally, the use of coin tosses here is just to provide insight. More 
generally, we can interpret the geometric random variable in terms of repeated 
independent trials until the first "success." Each trial has probability of success p 
and the number of trials until (and including) the first success is modeled by the 
geometric random variable. The meaning of "success" is context-dependent. For 
example, it could mean passing a test in a given try, finding a missing item in a 
given search, or finding the tax help information line free in a given attempt, etc. 


The Poisson Random Variable 


A Poisson random variable has a PMF given by 


M 
px(k) = eA— ` k=0,1,2,..., 
k! 
where À is a positive parameter characterizing the PMF, see Fig. 2.5. This is a 
legitimate PMF because 


oo 
AK Az 3 
Auer. LL piers na p ioe b= po xe — 
2: ms UIDI js e^ — ]. 
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Figure 2.5: The PMF e~*A*/k! of a Poisson random variable for different values 
of A. Note that if A < 1. then the PMF is monotonically decreasing with k, while 
if A > 1. the PMF first increases and then decreases (this is shown in the end-of- 
chapter problems). 


To get a feel for the Poisson random variable. think of a binomial random 
variable with very small p and very large n. For example. let X be the number 
of typos in a book with a total of n words. Then X is binomial, but since the 
probability p that any one word is misspelled is very small. X can also be well- 
modeled with a Poisson PMF (let p be the probability of heads in tossing a coin. 
and associate misspelled words with coin tosses that result in heads). There are 
many similar DS such as the number of cars involved in accidents in a 
city on a given day. 

More precisely. the Poisson PMF with parameter À is a good approximation 
for a binomial PMF with parameters n and p. i.e.. 

Ak n! 
et = finer d —p)r-k, k=0.1..... n. 

provided A = np. n is very large. and p is very small. In this case. using the 
Poisson PMF may result in simpler models and calculations. For example. let 
n = 100 and p = 0.01. Then the probability of k = 5 successes in n = 100 trials 
is calculated using the binomial PMF as 

100! : a 

9518 -0.015(1 — 0.01)95 = 0.00290. 
Using the Poisson PMF with A = np = 100- 0.01 = 1. this probability is 
approximated by 


1 
e 'Bl — 0.00306. 
We provide a formal justification of the Poisson approximation property 
in the end-of-chapter problems and also in Chapter 6. where we will further 


interpret it. extend it. and use it in the context of the Poisson process. 


} The first experimental verification of the connection between the binomial and 
the Poisson random variables reputedly occurred in the late 19th century. by matching 
the Poisson PMF to the number of horse kick accidents in the Polish cavalry over a 
period of several years. 


2.3 
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FUNCTIONS OF RANDOM VARIABLES 


Given a random variable X, one may generate other random variables by ap- 
plying various transformations on X. As an example, let the random variable 
X be today's temperature in degrees Celsius, and consider the transformation 
Y = 1.8X + 32. which gives the temperature in degrees Fahrenheit. In this 
example. Y is a linear function of X, of the form 


Y 2g(X)- aX +b, 


where a and b are scalars. We may also consider nonlinear functions of the 


general form 
Y = g(X). 


For example. if we wish to display temperatures on a logarithmic scale, we would 
want to use the function g(.X) = log X. 

If Y 2 g( X) is a function of a random variable X, then Y is also a random 
variable, since it provides a numerical value for each possible outcome. This is 
because every outcome in the sample space defines a numerical value z for X 
and hence also the numerical value y = g(x) for Y. If X is discrete with PMF 
px. then Y is also discrete, and its PMF py can be calculated using the PMF 
of X. In particular, to obtain py (y) for any y, we add the probabilities of all 
values of x such that g(x) = y: 


py(y)= | M px(z). 
{z | 9(z)=y} 


Example 2.1. Let Y = |X| and let us apply the preceding formula for the PMF 
py to the case where 


1/9. if z is an integer in the range [—4, 4]. 

px(z) = be 
0. otherwise: 

see Fig. 2.6 for an illustration. The possible values of Y are y = 0.1.2.3.4. To 

compute py (y) for some given value y from this range. we must add px(z) over 


all values x such that |r| = y. In particular. there is only one value of X that 
corresponds to y = 0. namely z = 0. Thus. 


py (0) = px (0) = 


ole 


Also. there are two values of X that correspond to each y = 1.2. 3. 4. so for example, 


py (1) = px(-1) + px (1) = z, 


Thus, the PMF of Y is 
2/9. ify=1.2,3,4, 


pv(y) 24 1/9. if y=0. 
0. otherwise. 
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Figure 2.6: The PMFs of X and Y = |X| in Example 2.1. 


For another related example. let Z — X?. To obtain the PMF of Z. we 
can view it either as the square of the random variable a or as the square of the 
random variable Y = |X|. By applying the formula pz(z = estes} px(x) or 
the formula pz(z) = MT Dv (y). we obtain 


2/9. ifz=1.4.9.16. 
p2(z)=¢ 1/9. ifz= 
0. otherwise. 


2.4 EXPECTATION, MEAN, AND VARIANCE 


The PMF of a randorn variable X provides us with several numbers. the proba- 
bilities of all the possible values of X. It is often desirable. however. to suminarize 
this information in a single representative number. This is accomplished by the 
expectation of X. which is a weighted (in proportion to probabilities) average 
of the possible values of X. 

As motivation. suppose vou spin a whecl of fortune many times. At each 
spin. one of the numbers mj. mp..... Nin comes up with corresponding proba- 
bility p1. pa. .... pa. and this is your monetary reward from that spin. What is 
the amount of money that you "expect" to get "per spin”? The terms "expect" 
and “per spin” are a little ambiguous. but here is a reasonable interpretation. 

Suppose that you spin the wheel k times. and that k; is the number of times 
that the outcome is mj. Then. the total amount received is miki + moka +: + 
Mnkn. The amount received per spin is 


miki + mak +++: + mnky 
k 





M = 


If the number of spins & is very large. and if we are willing to interpret proba- 
bilities as relative frequencies. it is reasonable to anticipate that im; comes up a 
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fraction of times that is roughly equal to p;: 


ki 
— f Di, (—]1,.yn. 


k 


Thus, the amount of money per spin that you “expect” to receive is 


mik; + moke +- + Mnkn 
M = ——______— © mipi + mapa + +++ + MinPn- 


Motivated by this example, we introduce the following definition.t 


Expectation 


We define the expected value (also called the expectation or the mean) 
of a random variable X, with PMF px, by 


E[X] = 3 zpx(2). 





Example 2.2. Consider two independent coin tosses, each with a 3/4 probability 
of a head, and let X be the number of heads obtained. This is a binomial random 
variable with parameters n = 2 and p = 3/4. Its PMF is 


(1/4) , if k — 0, 
px(k) = ¢ 2-(1/4)- (3/4), ifk— 1, 
(3/4)? , if k — 2, 


so the mean is 


1\? 1 3 3? 234 3 
= Onis ]e[59 m M pon Meu 
E(x = 0 (4) T ( 4 a e (5) 16 2 


t When dealing with random variables that take a countably infinite number 
of values, one has to deal with the possibility that the infinite sum ee rpx(r) is 
not well-defined. More concretely, we will say that the expectation is well-defined if 
3^. |zlpx (x) < oo. In this case, it is known that the infinite sum $^, zpx(r) converges 
to a finite value that is independent of the order in which the various terms are summed. 

For an example where the expectation is not well-defined, consider a random 
variable X that takes the value 2* with probability 2~*, for k = 1,2,.... For a more 
subtle example, consider a random variable X that takes the values 2 and —2* with 
probability 27*, for k = 2,3,.... The expectation is again undefined, even though the 
PMF is symmetric around zero and one might be tempted to say that E[X] is zero. 

Throughout this book, in the absence of an indication to the contrary, we implic- 
itly assume that the expected value of the random variables of interest is well-defined. 
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It is useful to view the mean of X as a "representative" value of X, which 
lies somewhere in the middle of its range. We can make this statement more 
precise, by viewing the mean as the center of gravity of the PMF, in the sense 
explained in Fig. 2.7. In particular, if the PMF is symmetric around a certain 
point, that point must be equal to the mean. 


i 


Center of gravity 
€ = mean = ELX] 


Figure 2.7: Interpretation of the mean as a center of gravity. Given a bar with 
a weight px (rz) placed at each point z with px (x) > 0, the center of gravity c is 
the point at which the sum of the torques from the weights to its left is equal to 
the sum of the torques from the weights to its right: 


3 — c)px (x) = 0. 


Thus, ¢ = 25 zpx(z), i.e., the center of gravity is equal to the mean E[X]. 


Variance, Moments, and the Expected Value Rule 


Besides the mean, there are several other quantities that we can associate with 
a random variable and its PMF. For example, we define the 2nd moment 
of the random variable X as the expected value of the random variable X?. 
More generally, we define the nth moment as E[X"|, the expected value of the 
random variable X^. With this terminology, the 1st moment of X is just the 
mean. 

The most important quantity associated with a random variable X, other 
than the mean, is its variance, which is denoted by var(.X) and is defined as 


the expected value of the random variable (X — E[X])', i.e., 
var(X) =E [(x & E(X])’| 


Since (X — E[X e can only take nonnegative values, the variance is always 
nonnegative. 

The variance provides a measure of dispersion of X around its mean. An- 
other measure of dispersion is the standard deviation of X, which is defined 
as the square root of the variance and is denoted by ox: 


Ox = V var( X). 
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The standard deviation is often easier to interpret because it has the same units 
as X. For example, if X measures length in meters, the units of variance are 
square meters, while the units of the standard deviation are meters. 

One way to calculate var(X), is to use the definition of expected value, 


after calculating the PMF of the random variable (X — E[X p. This latter 
random variable is a function of X, and its PMF can be obtained in the manner 
discussed in the preceding section. 


Example 2.3. Consider the random variable X of Example 2.1. which has the 
PMF 


1/9. if x is an integer in the range [—4. 4|. 
pte) = { / g ge [-4. 4] 


0, otherwise. 


The mean E[X] is equal to 0. This can be seen from the symmetry of the PMF of 
X around 0, and can also be verified from the definition: 


E[X] = Y 2px(z) = 5 a rc. 


r--4 
LetZ — (X — E[X]) — X?. As in Example 2.1, we have 


2/9, if z=1,4.9.16, 
pz(z2) — 4 1/9. if z=0, 
0. otherwise. 


The variance of X is then obtained by 
- = 2] 2 e 
va(X) = E[Z] = $ zpz(2) -0- 5 1-5 4-5 49.5 16:5 = —. 


It turns out that there is an easier method to calculate var(X ). which uses 


the PMF of X but does not require the PMF of (X — E[X ». This method 
is based on the following rule. 


Expected Value Rule for Functions of Random Variables 


Let X be a random variable with PMF px, and let g(.X) be a function of 
X. Then, the expected value of the random variable g( X) is given by 


E[g(X)] = $- g(z)px(z). 





To verify this rule, we let Y = g(X) and use the formula 


py(y)= >. px(z) 


{z| 9(z)=y} 
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derived in the preceding section. We have 
E[g(X)] = ElY] 


- M up (y) 
=Soy M) px) 


y = {z|g9(z)=y} 
= Y 5 ypx (x) 
y {z|g(1)=y} 


=% J, y ()px(s) 


y (zlg(z)-v) 


= M g(z)px() 


Using the expected value rule, we can write the variance of X as 
var(X) = E [(x - E[X])"] = Z (e - EIX])’px(z) 


Similarly, the nth moment is given by 


E[X"] = V ^ z^ px (z) 


and there is no need to calculate the PMF of X^. 


Example 2.3 (continued). For the random variable X with PMF 


1/9. if x is an integer in the range [—4. 4]. 
0. otherwise, 


px(Z) = { 


we have var(X) 2 E I(x BuU | 


= Ys - einn 


=: > r? (since E[X] = 0) 
r=—4 
=5(16+9+4+1+0+1+4+9+16) 

_ 60 

| 9 


which is consistent with the result obtained earlier. 
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As we have already noted, the variance is always nonnegative, but could it 
be zero? Since every term in the formula Y^, (x — E[X ^ px(z) for the variance 
is nonnegative, the sum is zero if and only if (z — E[X])?px (x) = 0 for every z. 
This condition implies that for any z with px(x) > 0, we must have z = E[X] 
and the random variable X is not really “random”: its value is equal to the mean 
E[X ]. with probability 1. 


Variance 


The variance var(X) of a random variable X is defined by 


var(X) 2 E [x = E[x])’| ; 


and can be calculated as 


var(X) — 2-6 — E[X]) ^ px (z). 


r 


It is always nonnegative. Its square root is denoted by ox and is called the 
standard deviation. 





Properties of Mean and Variance 


We will now use the expected value rule in order to derive some important 
properties of the mean and the variance. We start with a random variable X 
and define a new random variable Y , of the form 


Y =aX +b. 


where a and b are given scalars. Let us derive the mean and the variance of the 
linear function Y. We have 


E[Y| = 3 (azr + b)px(zx) = a "zpx(z) + bY px(z) = aE[X] +b. 


Furthermore. 
var(Y) = X (ar -tb-E[aX + b) px (z) 


r 


= 3 (ar +b -aE[X] — b) px (z) 


=a? le — E[X]) px (2) 


= a? var( X). 
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Mean and Variance of a Linear Function of a Random Variable 


Let X be a random variable and let 
Y =aX +b. 
where a and 5 are given scalars. Then, 


E[Y] = aE[X] + b, var(Y ) = a? var( X). 





Let us also give a convenient alternative formula for the variance of a 
random variable X. 


Variance in Terms of Moments Expression 


var(X) = E[X?] - (E[X])”. 





This expression is verified as follows: 


var(X) = $ (z - E[X]) px (z) 


T 


= > (22 — 2rE[X] + (E[X])*)px(z) 
= X z?px(2) — 2E[X] X px(x) + (EIX) V7 px (x) 


= E[x?] - 2(E[X])" + (E(X])' 
= E[X?] - (E|X])". 


We finally illustrate by example a common pitfall: unless g(.X) is a linear 
function, it is not generally true that E[g(X)] is equal to g(E[X]). 


Example 2.4. Average Speed Versus Average Time. If the weather is good 
(which happens with probability 0.6). Alice walks the 2 miles to class at a speed 
of V — 5 miles per hour, and otherwise rides her motorcycle at a speed of V — 30 
miles per hour. What is the mean of the time T to get to class? 

A correct way to solve the problem is to first derive the PMF of T. 


0.6. ift — 2/5 hours. 
pr(t) = MN LE 
0.4. ift — 2/30 hours. 


and then calculate its mean by 
2 2 


4 
— . MET " —_=__ : 
E/7 ] 0.6 + 0.4 15 hours 
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However, it is wrong to calculate the mean of the speed V, 
E[V] = 0.6.5 + 0.4 - 30 = 15 miles per hour, 
and then claim that the mean of the time T' is 


m = 2 hours 
E[V] 15 i 


To summarize, in this example we have 


2 2 


T- DA and BIT] =E| =| * ay 


Mean and Variance of Some Common Random Variables 


Chap. 2 


We will now derive formulas for the mean and the variance of a few important 
random variables. These formulas will be used repeatedly in a variety of contexts 


throughout the text. 


Example 2.5. Mean and Variance of the Bernoulli. Consider the experi- 
ment of tossing a coin, which comes up a head with probability p and a tail with 
probability 1 — p. and the Bernoulli random variable X with PMF 


| fp ifk=1. 
px() e (0 if k= 0. 


The mean. second moment. and variance of X are given by the following calcula- 
tions: 


E[X] 2 1: p - 0: (1- p) =p. 
E[X^] =1?-p+0-(1—p) =p, 


var(X) = E[X?] - (E[X])! = p - p? = p - p). 


Example 2.6. Discrete Uniform Random Variable. What is the mean and 
variance associated with a roll of a fair six-sided die? If we view the result of the 
roll as a random variable X, its PMF is 


1/6, if k —1.2,3,4,5.6, 
0, otherwise. 


px(k) = { 


Since the PMF is symmetric around 3.5. we conclude that E[X] = 3.5. Regarding 
the variance, we have 


var(X) = E[X?] - (E[X])’ 


- za? +2 43°44? +5? +67) — (3.5)*, 
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à Px ( k j 


1 
bh-a+l 
ee 
ü A k 


Figure 2.8: PMF of the discrete random variable that is uniformly dis- 
tributed between two integers a and b. Its mean and variance are 





a+b mos Cae Ea 


Bp 12 





which yields var(.X) — 35/12. 

The above random variable is a special case of a discrete uniformly dis- 
tributed random variable (or discrete uniform for short). which by definition. 
takes one out of a range of contiguous integer values, with equal probability. More 
precisely, a discrete uniform random variable has a PMF of the form 


1 . 
{Parr ifk=a.at+] Re are b. 


0. otherwise. 


where a and b are two integers with a « b: see Fig. 2.8. 


The mean is A 
a 
ELX| = : 
Ix] » *7 
as can be seen by inspection, since the PMF is symmetric around (a + b)/2. To 
calculate the variance of X. we first consider the simpler case where a — 1 and 


b — n. It can be verified by induction on n that 





E[X?] = -ye - sn 1)(2n + 1). 
k=} 


We leave the verification of this as an exercise for the reader. The variance can now 
be obtained in terms of the first and second moments 


var(X) = E[X?] - (E[X])” 


1 
-Q1) ^ 
= aint 1)(4n + 2 - 3n — 3) 
n?—-1 





12 


90 Discrete Random Variables Chap. 2 


For the case of general integers a and b, we note that a random variable which 
is uniformly distributed over the interval [a, b] has the same variance as one which 
is uniformly distributed over [1,b — a+ 1], since the PMF of the second is just a 
shifted version of the PMF of the first. Therefore, the desired variance is given by 
the above formula with n = b — a + 1, which yields 


vix Presi = ia eae) 


Example 2.7. The Mean of the Poisson. The mean of the Poisson PMF 


k 
px(k) = e>, k —20,1,2,..., 
can be calculated is follows: 
= EX 
E[X] = 2 ke n 
= ea (the k = 0 term is zero) 
k=1 i 
x AKT? 
=À e? 
27 aT 
=a% eot (let m — k — 1) 
m=0 ] 
ES. 


The last equality is obtained by noting that 


oc 


` e^ = X px(m) =1 
= m=0 


m=0 


is the normalization property for the Poisson PMF. 

A similar calculation shows that the variance of a Poisson random variable 
is also à: see Example 2.20 in Section 2.7. We will derive this fact in a number of 
different ways in later chapters. 


Decision Making Using Expected Values 


Expected values often provide a convenient vehicle for optimizing the choice 
between several candidate decisions that result in random rewards. If we view 
the expected reward of a decision as its “average payoff over a large number of 
trials," it is reasonable to choose a decision with maximum expected reward. 
The following is an example. 
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Example 2.8. The Quiz Problem. This example, when generalized appro- 
priately, is a prototypical formulation of the problem of optimal sequencing of a 
collection of tasks with uncertain rewards. 

Consider a quiz game where a person is given two questions and must decide 
which one to answer first. Question 1 will be answered correctly with probability 
0.8, and the person will then receive as prize $100, while question 2 will be answered 
correctly with probability 0.5, and the person will then receive as prize $200. If the 
first question attempted is answered incorrectly, the quiz terminates, i.e., the person 
is not allowed to attempt the second question. If the first question is answered 
correctly, the person is allowed to attempt the second question. Which question 
should be answered first to maximize the expected value of the total prize money 
received? 

The answer is not obvious because there is a tradeoff: attempting first the 
more valuable but also more difficult question 2 carries the risk of never getting a 
chance to attempt the easier question 1. Let us view the total prize money received 
as a random variable X, and calculate the expected value E(X] under the two 
possible question orders (cf. Fig. 2.9): 


$100 





$ 300 





$ 300 
Question 1 Question 2 


answered first answered first 


Figure 2.9: Sequential description of the sample space of the quiz problem 
for the two cases where we answer question 1 or question 2 first. 


(a) Answer question 1 first: Then the PMF of X is (cf. the left side of Fig. 2.9) 
px(0)2 0.2,  px(100) = 0.8.0.5, — px(300) = 0.8 0.5, 
and we have 


E[X] = 0.8- 0.5. 100 + 0.8 - 0.5 - 300 = $160. 


(b) Answer question 2 first: Then the PMF of X is (cf. the right side of Fig. 2.9) 
px(0)=0.5, — px(200)— 0.5.0.2.  px(300) = 0.5.0.8, 
and we have 


E[X] = 0.5 -0.2- 200 + 0.5 - 0.8 - 300 = $140. 
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Thus, it is preferable to attempt the easier question 1 first. 

Let us now generalize the analysis. Denote by pi and p» the probabilities 
of correctly answering questions 1 and 2, respectively, and by vi and v2 the corre- 
sponding prizes. If question 1 is answered first, we have 


E[X] = pil — p2)vi + pipe(vi + v2) = piri + pipavz, 
while if question 2 is answered first, we have 
E[X] = pz(1 — pi)ve + pzpi(va + v1) = p2v2 + p2pirr. 
It is thus optimal to answer question 1 first if and only if 
pivi + pip2U2 2 pavo + popivi. 


or equivalently. if 
pv > P2v2 


l-p 1-p» 
Therefore, it is optimal to order the questions in decreasing value of the expression 
pv/(1 — p). which provides a convenient index of quality for a question with prob- 
ability of correct answer p and value v. Interestingly, this rule generalizes to the 
case of more than two questions (see the end-of-chapter problems). 


2.5 JOINT PMFS OF MULTIPLE RANDOM VARIABLES 


Probabilistic models often involve several random variables. For example, in 
a medical diagnosis context, the results of several tests may be significant, or 
in a networking context. the workloads of several routers may be of interest. 
All of these random variables are associated with the same experiment, sample 
space, and probability law, and their values may relate in interesting ways. This 
motivates us to consider probabilities of events involving simultaneously several 
random variables. In this section, we will extend the concepts of PMF and 
expectation developed so far to multiple random variables. Later on, we will 
also develop notions of conditioning and independence that closely parallel the 
ideas discussed in Chapter 1. 

Consider two discrete random variables X and Y associated with the same 
experiment. The probabilities of the values that X and Y can take are captured 
by the joint PMF of X and Y, denoted px,y. In particular. if (x,y) is a pair 
of possible values of X and Y, the probability mass of (x.y) is the probability 
of the event (X = zx. Y = y): 


px.y(z.y) 2 P(X =2.Y =y). 


Here and elsewhere. we use the abbreviated notation P(.X — r, Y — y) instead 
of the more precise notations P((X = z) n (Y = y}) or P(X 2 x and Y =y). 
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The joint PMF determines the probability of any event that can be specified 
in terms of the random variables X and Y. For example if A is the set of all 
pairs (z, y) that have a certain property, then 


P(X.Y)eA)-2 M] pxv(z.y). 
(z.y)EA 


In fact, we can calculate the PMFs of X and Y by using the formulas 


px(z) - Mipxv(r.y) — pv(y) = M2pxv(s. y). 


The formula for px (x) can be verified using the calculation 
px(z) = P(X = z) 
=) P(X =2.Y =y) 
y 


= 3 px.y (2, y), 


where the second equality follows by noting that the event {X = zx} is the union 
of the disjoint events (X = x. Y = y) as y ranges over all the different values of 
Y. The formula for py (y) is verified similarly. We sometimes refer to px and 
py as the marginal PMFs, to distinguish them from the joint PMF. 

We can calculate the marginal PMFs from the joint PMF by using the 
tabular method. Here. the joint PMF of X and Y is arranged in a two- 
dimensional table, and the marginal PMF of X or Y at a given value is obtained 
by adding the table entries along a corresponding column or row. respectively. 
This method is illustrated by the following example and Fig. 2.10. 


Example 2.9. Consider two random variables. X and Y, described by the joint 
PMF shown in Fig. 2.10. The marginal PMFs are calculated by adding the table 
entries along the columns (for the marginal PMF of X) and along the rows (for the 
marginal PMF of Y), as indicated. 


Functions of Multiple Random Variables 


When there are multiple random variables of interest, it is possible to generate 
new random variables by considering functions involving several of these random 
variables. In particular, a function Z = g(X.Y ) of the random variables X and 
Y defines another random variable. Its PMF can be calculated from the joint 
PMF px.v according to 


pz(z) = 5 px.v (z. y). 


{(z.y) | 9(z.y)=z} 
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Joint PME py yir.) 


in tabnlar form 


Row sums: 


marginal PMF pyiu) 





Columa sums: 
marginali PMF pez! 


Figure 2.10: Illustration of the tabular method for calculating the marginal 
PMFs from the joint PMF in Example 2.9. The joint PMF is represented by the 
table, where the number in each square (z, y) gives the value of py y(z,y). To 
calculate the marginal PMF px (z) for a given value of z, we add the numbers in 
the column corresponding to z. For example px (2) — 6/20. Similarly, to calculate 
the marginal PMF py (y) for a given value of y, we add the numbers in the row 
corresponding to y. For example py (2) = 7/20. 


Furthermore. the expected value rule for functions naturally extends and takes 
the form : 
E[9(X,Y)| = y ŞS o(a.y)px.y (ay). 
zo y 


The verification of this is very similar to the earlier case of a function of a single 
random variable. In the special case where g is linear and of the form aX +bY +c, 
where a, b, and c are given scalars, we have 


E[a X + bY +c] = aB[X]+ bE[Y | +c. 
Example 2.9 (continued). Consider again the random variables X and Y whose 
joint PMF is given in Fig. 2.10, and a new random variable Z defined by 
Z=X+2Y. 


The PMF of Z can be calculated using the formula 


pz(z)= — M. — pxv(ny) 


{(z,y) | z+2y=2} 
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and we have, using the PMF given in Fig. 2.10, 


1 1 2 2 4 
= —— 4) = — = — = — = — 


3 3 2 1 1 
p2(8) = 55 pz(9) = 55; p2(10) = 55. p2(11) = 55. pz(12) = 55: 


The expected value of Z can be obtained from its PMF: 


E[Z] = V ' zpz(z) 
1 1 2 2 4 
El dc uc um A LES cu. x 
20 5529 ^? 29 * 9^ 29 * ** 3g 
3 3 2 1 1 
IE: Que es ps cus mun 
18555 59 59g Dra DES P 
= 7.55. 


Alternatively, we can obtain E[Z] using the formula 
E[Z] = E[X] + 2E|Y]. 


From the marginal PMFs, given in Fig. 2.10, we have 


3 6 8 3 5 
EX) Sie ete pel qd e. 
3 7 7 3 50 
Eyj=i Rae oo) ton = a0" 
i 51 50 
= 2 42. © = 755: 
E[2Z] = 35t 3 


More than Two Random Variables 


The joint PMF of three random variables X , Y , and Z is defined in analogy with 


the above as 
Px,y.z(2.y,2) = P(X 2 x,Y = y,Z = 2). 


for all possible triplets of numerical values (z,y,z). Corresponding marginal 


PMFs are analogously obtained by equations such as 


px,v(z.y) = Y px.y,z(£, y, z), 


z 


and 


px(x) = » `> px,y.z(x, y. 2). 
y z 
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The expected value rule for functions is given by 
E|o(X, Y. Z)] rx `> >: X g(s, yY, z)px,y,z(z, y. 2), 
r y z 


and if g is linear and has the form a X + bY + cZ + d, then 
E[aX + bY + cZ + d) = aE[X] + bE[Y] + cE[Z] + d. 


Furthermore, there are obvious generalizations of the above to more than three 
random variables. For example, for any random variables X1, X2,...,Xn and 
any scalars a1,a2..... , ûn, we have 


Bla, X1 + a2X2 +--+ + a4 Xs] = a1 E[X1]  a2E[X5] +--- + an E[X.]. 


Example 2.10. Mean of the Binomial. Your probability class has 300 students 
and each student has probability 1/3 of getting an A, independent of any other 
student. What is the mean of X. the number of students that get an A? Let 


{ 1. if the ith student gets an A, 
X, = . 
0. otherwise. 


Thus Xi. X2..... Xn are Bernoulli random variables with common mean p = 1/3. 
Their sum 
X-Xi|ct Xoct--cT XS 


is the number of students that get an A. Since X is the number of "successes" in n 
independent trials, it is a binomial random variable with parameters n and p. 
Using the linearity of X as a function of the X;, we have 


300 


300 
E[X] = 2 EU |= Dg = 399.5 = 100. 


If we repeat this calculation for a general number of students n and probability of 
A equal to p, we obtain 


E[X] = Y E[X,] = 3 p= np. 


Example 2.11. The Hat Problem. Suppose that n people throw their hats in 
a box and then each picks one hat at random. (Each hat can be picked by only 
one person, and each assignment of hats to persons is equally likely.) What is the 
expected value of X, the number of people that get back their own hat? 

For the ith person. we introduce a random variable X; that takes the value 
1 if the person selects his/her own hat. and takes the value O otherwise. Since 
P(X; = 1) = 1/n and P(X; = 0) = 1 — 1/n, the mean of X; is 
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We now have 
X=X,4+Xot---+ Xn. 


so that i 
E[X] = E[X;] + E[X2] +--+ E[X,] =n. aad 


Summary of Facts About Joint PMFs 
Let X and Y be random variables associated with the same experiment. 


e The joint PMF px.y of X and Y is defined by 


px,v(z, y) = P(X 2 z,Y =y). 


e The marginal PMFs of X and Y can be obtained from the joint 
PMF, using the formulas 


px(z)9 So pxy(2.y), — py(y) = do px v (z. y). 
y T 


e A function g(.X, Y ) of X and Y defines another random variable, and 


E[g(X. Y)] = X` S 1 9(z.y)px.v (z,y)- 
Toy 


If g is linear, of the form aX + bY + c, we have 


E[a X + bY + c] = aE[X] + bE[Y] +c. 


e The above have natural extensions to the case where more than two 
random variables are involved. 





2.6 CONDITIONING 


Similar to our discussion in Chapter 1. conditional probabilities can be used to 
capture the information conveyed by various events about the different possible 
values of a random variable. We are thus motivated to introduce conditional 
PMFs, given the occurrence of a certain event or given the value of another ran- 
dom variable. In this section, we develop this idea and we discuss the properties 
of conditional PMFs. In reality though, there is not much that is new, only an 
elaboration of concepts that are familiar from Chapter 1. together with some 
new notation. 
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Conditioning a Random Variable on an Event 


The conditional PMF of a random variable X, conditioned on a particular 
event A with P(A) > 0, is defined by 


P({X =2}N A) 


pxia(e) = P(X = 2| 4) = Ep 


Note that the events (X = r} 1 A are disjoint for different values of z, their 
union is A, and, therefore, 


P(A) 2 Y! P((X 2 z)n A). 
Combining the above two formulas, we see that 
3 pxia(2) = 1, 


so px|a is a legitimate PMF. 

The conditional PMF is calculated similar to its unconditional counterpart: 
to obtain px|4A(x), we add the probabilities of the outcomes that give rise to 
X =z and belong to the conditioning event A, and then normalize by dividing 
with P(A). 


Example 2.12. Let X be the roll of a fair six-sided die and let A be the event that 
the roll is an even number. Then, by applying the preceding formula, we obtain 


Px|A(k) = P(X =k|roll is even) 
P(X =k and X is even) 


P (roll is even) 


EDS if k — 2,4,6, 
~ (40, otherwise. 


Example 2.13. A student will take a certain test repeatedly, up to a maximum 
of n times, each time with a probability p of passing, independent of the number 
of previous attempts. What is the PMF of the number of attempts, given that the 
student passes the test? 

Let A be the event that the student passes the test (with at most n attempts). 
We introduce the random variable X, which is the number of attempts that would 
be needed if an unlimited number of attempts were allowed. Then, X is a geometric 
random variable with parameter p, and A = (X < n). We have 


n 


P(A) = 5/17 "^p, 


m=l1 
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A px 





Figure 2.11: Visualization and calculation of the conditional PMF px,A(Kk) in 
Example 2.13. We start with the PMF of X, we set to zero the PMF values for all 
k that do not belong to the conditioning event A. and we normalize the remaining 
values by dividing with P(A). 


and ses 
XP Eo fks 
pxja(k) = we -p"'p 
mei 


0. otherwise. 


as illustrated in Fig. 2.11. 


Figure 2.12 provides a more abstract visualization of the construction of 
the conditional PMF. 





Event {X=b} 





Sample space 


Q 





Figure 2.12: Visualization and calculation of the conditional PMF px, 4(z). For 
each z, we add the probabilities of the outcomes in the intersection (X = z}N A, 
and normalize by dividing with P(A). 


Conditioning one Random Variable on Another 
Let X and Y be two random variables associated with the same experiment. If 


we know that the value of Y is some particular y [with py(y) > 0], this provides 
partial knowledge about the value of X. This knowledge is captured by the 
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conditional PMF py \y of X given Y, which is defined by specializing the 
definition of px|4 to events A of the form (Y = y}: 


Pxiv(z|y) = P(X =2|Y =y). 
Using the definition of conditional probabilities. we have 


nla) — PAX = nY =) _ px. (ty) 
piel =  pyly T wey 


Let us fix some y with py (y) > 0. and consider pxjy (z |y) as a function 
of x. This function is a valid PMF for X: it assigns nonnegative values to cach 
possible r. and these values add to 1. Furthermore. this function of z has the 
same shape as px.y (x.y) except that it is divided by py (y). which enforces the 
normalization property 


X pxiy (ty) = 1. 


Figure 2.13 provides a visualization of the conditional PMF. 


Conditional PME 
PX Y i] dj 


Slice view 
of conditional PMF 





Conditi onal PMF 
Px! yit 2) 


| de 


Conditional PME 
Wo Pxprle 


| | o 





- 


Joint PMF py yEy) 





Figure 2.13: Visualization of the conditional PMF pxjy(r|y). For each y, we 
view the joint PMF along the slice Y — y and renormalize so that 


Y pxiv(zly) =l. 
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The conditional PMF is often convenient for the calculation of the joint 
PMF, using a sequential approach and the formula 


Dx.v(z,y) = py(y)pxiy (z |Y), 


or its counterpart 
Px.y (2, y) = Px(Z)py|x (y | z). 


This method is entirely similar to the use of the multiplication rule from Chap- 
ter l. The following example provides an illustration. 


Example 2.14. Professor May B. Right often has her facts wrong, and answers 
each of her students' questions incorrectly with probability 1/4, independent of 
other questions. In each lecture, May is asked 0, 1, or 2 questions with equal 
probability 1/3. Let X and Y be the number of questions May is asked and the 
number of questions she answers wrong in a given lecture, respectively. To construct 
the joint PMF px.v (x,y), we need to calculate the probability P(X = z,Y = y) 
for all combinations of values of z and y. This can be done by using a sequential 
description of the experiment and the multiplication rule, as shown in Fig. 2.14. 
For example. for the case where one question is asked and is answered wrong, we 


have 
1 1 1 


3 1^i 

The joint PMF can be represented by a two-dimensional table, as shown in Fig. 
2.14. It can be used to calculate the probability of any event of interest. For 
instance, we have 


px.y (1,1) = px(x)py|x(y|z) = 


P(at least one wrong answer) = px.v (1,1) + px.y (2,1) + px.y (2, 2) 
4 6 1 


The conditional PMF can also be used to calculate the marginal PMFs. In 
particular. we have by using the definitions, 


px(z) = 3 px.v(m y) = Y py (y)pxiy (219). 
y y 


This formula provides a divide-and-conquer method for calculating marginal 
PMFs. It is in essence identical to the total probability theorem given in Chap- 
ter 1, but cast in different notation. The following example provides an illustra- 
tion. 


Example 2.15. Consider a transmitter that is sending messages over a computer 
network. Let us define the following two random variables: 


X : the travel time of a given message, Y : the length of the given message. 


We know the PMF of the travel time of a message that has a given length, and we 
know the PMF of the message length. We want to find the (unconditional) PMF 
of the travel time of a message. 
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Probe 174s 


Prob: 6/45 


Prob: 9/48 


Prob: 4/48 


Prob: 12; 45 





Joint PMP py y Cri) 


X; Number of Y : Number of - 
decano : in tabular form 
questions asked questiong answered 
wrong 


Figure 2.14: Calculation of the joint PMF px y(z,y) in Example 2.14. 


We assume that the length of a message can take two possible values: y = 10? 

bytes with probability 5/6, and y — 10* bytes with probability 1/6, so that 
5/6, dt: 107, 
pyly) = l 1/6, af ys 10". 


We assume that the travel time X of the message depends on its length Y and the 
congestion in the network at the time of transmission. In particular, the travel time 
is 1074Y seconds with probability 1/2, 107°Y seconds with probability 1/3, and 
107?Y seconds with probability 1/6. Thus, we have 


1/2, ifz-107?, 1/2, if =1, 
pxiv(z|107) = ¢ 1/8, ifz = 107!, pxiv(z|10*) = 1/3, ifz — 10, 
1/6, ifz—1, 1/6, ifz — 100. 


To find the PMF of X, we use the total probability formula 


px(z) = $ pv (y)pxiv (zl y). 


We obtain 
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We finally note that one can define conditional PMFs involving more than 
two random variables, such as Dx.Yiz(z.y | z) or DPx|v.z(z|y.z). The concepts 
and methods described above generalize easily. 













Summary of Facts About Conditional PMFs 
Let X and Y be random variables associated with the same experiment. 


e Conditional PMFs are similar to ordinary PMFs, but pertain to a 
universe where the conditioning event is known to have occurred. 


The conditional PMF of X given an event A with P(A) > 0, is defined 
by 


Pxj|a(z) = P(X = z| A) 






and satisfies 


3 pxia(2) =1. 








If A},...,An are disjoint events that form a partition of the sample 
space, with P(A;) > 0 for all i, then 


-DPA i)Px|A, (x). 


(This is a special case of the total probability theorem.) Furthermore, 
for any event B, with P(A; N B) > 0 for all i, we have 















Px|B(z -Ypa i | B)Px|A;nB(x). 


e The conditional PMF of X given Y = y is related to the joint PMF 
by 
px,y (z; y) = Pv (y)Pxiv (2 | y). 


The conditional PMF of X given Y can be used to calculate the 
marginal PMF of X through the formula 


px(z) = 3 py (y)pxiy (z |y). 
y 


e There are natural extensions of the above involving more than two 
random variables. 


104 Discrete Random Variables Chap. 2 


Conditional Expectation 


A conditional PMF can be thought of as an ordinary PMF over a new universe 
determined by the conditioning event. In the same spirit. a conditional expec- 
tation is the same as an ordinary expectation, except that it refers to the new 
universe. and all probabilities and PMFs are replaced by their conditional coun- 
terparts. (Conditional variances can also be treated similarly.) We list the main 
definitions and relevant facts below. 


Summary of Facts About Conditional Expectations 
Let X and Y be random variables associated with the same experiment. 


e The conditional expectation of X given an event A with P(A) » 0, is 
defined by 


E[X | A] = 2 px 2) 
For a function g(X), we have 


= X o()pxja(2) 


e The conditional expectation of X given a value y of Y is defined by 
E[X|Y =y) = Y zpxjy(cly). 


e If Ai,....An be disjoint events that form a partition of the sample 
space, with P(A;) > 0 for all i, then 


E[X] = 2 | P(A)JEIX | Aj. 


Furthermore, for any event B with P(Ai 1 B) » 0 for all i, we have 


E[X | B] = y» (A; | B)E[X | A; BI. 


i-1 


E|X] = Pry) (y)E[X |Y = y]. 





The last three equalities above apply in different situations. but are essen- 
tially equivalent. and will be referred to collectively as the total expectation 
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theorem. They all follow from the total probability theorem. and express the 
fact that “the unconditional average can be obtained by averaging the condi- 
tional averages." They can be used to calculate the unconditional expectation 
E[X] from the conditional PMF or expectation, using a divide-and-conquer ap- 
proach. To verify the first of the three equalities. we write 


-DPA i)Pz| a, ( (x | Ai). 


we multiply both sides by z. and we sum over z: 


E[X] = 2 2px) 
s Ya i)Pz|A, (x | Ai) 


= Y P(A) spa (01 A) 
i=1 r 
= V P(AJE[X | Ai). 
i=1 
The remaining two equalities are verified similarly. 


Example 2.16. Messages transmitted by a computer in Boston through a data 
network are destined for New York with probability 0.5, for Chicago with probability 
0.3, and for San Francisco with probability 0.2. The transit time X of a message 
is random. Its mean is 0.05 seconds if it is destined for New York, 0.1 seconds if it 
is destined for Chicago, and 0.3 seconds if it is destined for San Francisco. Then. 
E[X] is easily calculated using the total expectation theorem as 


E[X] = 0.5 - 0.05 + 0.3- 0.1 + 0.2 -0.3 = 0.115 seconds. 


Example 2.17. Mean and Variance of the Geometric. You write a software 
program over and over, and each time there is probability p that it works correctly. 
independent of previous attempts. What is the mean and variance of X. the number 
of tries until the program works correctly? 

We recognize X as a geometric random variable with PMF 


px(k)2 (1—p) p  k=1,2,.... 


The mean and variance of X are given by 


E[X]= M k1- "p . va(X) = 3 (k - E[X])?(1 - Dp. 
k=1 


k=1 
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but evaluating these infinite sums is somewhat tedious. As an alternative, we will 
apply the total expectation theorem, with Aj — (X — 1) — (first try is a success), 
A2 = (X > 1) = (first try is a failure), and end up with a much simpler calcula- 
tion. 

If the first try is successful, we have X — 1, and 


E[X|X = 1] - 1. 


If the first try fails (X > 1). we have wasted one try, and we are back where we 
started. So, the expected number of remaining tries is E[X], and 


E[X| X > 1] 2 1 + E[X]. 
Thus, 
E[X] = P(X = 1)E[X |X = 1] + P(X > 1)E[X|X > 1] 
= p + (1 — p)(1 + E[X]), 


from which we obtain 
E[X] = 


Vi- 


With similar reasoning, we also have 
E[X?|X=1]=1,  E[X?|X > 1] = E[(1 + X)?] 2 1 + 2E[X] + E[X?], 
so that 
E(X?] = p-1+ (1 - p (1 2E[X] + E[X?]), 
from which we obtain 


s Deep ELA] 


E[X?] : 


and, using the formula E[X] — 1/p derived above, 


We conclude that 


var(X) = E[X?] - (E[X])? = 3 à : = z = — 





Example 2.18. The Two-Envelopes Paradox. This is a much discussed puzzle 
that involves a subtle mathematical point regarding conditional expectations. 

You are handed two envelopes. and you are told that one of them contains 
m times as much money as the other, where m is an integer with m » 1. You 
open one of the envelopes and look at the amount inside. You may now keep this 
amount, or you may switch envelopes and keep the amount in the other envelope. 
What is the best strategy? 
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Here is a line of reasoning that argues in favor of switching. Let A be the 
envelope you open and B be the envelope that you may switch to. Let also z and 
y be the amounts in A and B, respectively. Then, as the argument goes, either 
y = T/m or y = mz, with equal probability 1/2, so given z, the expected value of 
y is 

E a 
2m 2 2\m 2m , 





since 1 + m? > 2m for m > 1. Therefore, you should always switch to envelope B! 
But then, since you should switch regardless of the amount found in A, you might 
as well open B to begin with; but once you do, you should switch again, etc. 

There are two assumptions, both flawed to some extent, that underlie this 
paradoxical line of reasoning. 


(a) You have no a priori knowledge about the amounts in the envelopes, so given 
x, the only thing you know about y is that it is either 1/m or m times z, and 
there is no reason to assume that one is more likely than the other. 


(b) Given two random variables X and Y, representing monetary amounts, if 
E[Y|X =z] >z, 


for all possible values x of X, then the strategy that always switches to Y 
yields a higher expected monetary gain. 


Let us scrutinize these assumptions. 

Assumption (a) is flawed because it relies on an incompletely specified prob- 
abilistic model. Indeed, in any correct model, all events, including the possible 
values of X and Y, must have well-defined probabilities. With such probabilistic 
knowledge about X and Y, the value of X may reveal a great deal of information 
about Y. For example, assume the following probabilistic model: someone chooses 
an integer dollar amount Z from a known range [z, 2] according to some distribu- 
tion, places this amount in a randomly chosen envelope, and places m times this 
amount in the other envelope. You then choose to open one of the two envelopes 
(with equal probability), and look at the enclosed amount X. If X turns out to 
be larger than the upper range limit z, you know that X is the larger of the two 
amounts, and hence you should not switch. On the other hand, for some other 
values of X, such as the lower range limit z, you should switch envelopes. Thus, in 
this model, the choice to switch or not should depend on the value of X. Roughly 
speaking, if you have an idea about the range and likelihood of the values of X, you 
can judge whether the amount X found in A is relatively small or relatively large, 
and accordingly switch or not switch envelopes. 

Mathematically, in a correct probabilistic model, we must have a joint PMF 
for the random variables X and Y, the amounts in envelopes A and B, respectively. 
This joint PMF is specified by introducing a PMF pz for the random variable Z, 
the minimum of the amounts in the two envelopes. Then, for all z, 


1 
px,y (mz, 2) = px.v (2, mz) = zpz(z), 


and 
px,y (T, y) = 0, 
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for every (z,y) that is not of the form (mz, z) or (z, mz). With this specification 
of px.v (z, y), and given that X = z, one can use the rule 


switch if and only if E[Y | X =] > z. 


According to this decision rule, one may or may not switch envelopes, depending 
on the value of X, as indicated earlier. 

Is it true that, with the above described probabilistic model and decision 
rule, you should be switching for some values x but not for others? Ordinarily yes, 
as illustrated from the earlier example where Z takes values in a bounded range. 
However, here is a devilish example where because of a subtle mathematical quirk, 
you will always switch! 

A fair coin is tossed until it comes up heads. Let N be the number of tosses. 
Then, m dollars are placed in one envelope and m^! dollars are placed in the 
other. Let X be the amount in the envelope you open (envelope A), and let Y be 
the amount in the other envelope (envelope B). 

Now, if A contains $1, clearly B contains $m, so you should switch envelopes. 
If. on the other hand, A contains m” dollars, where n > 0, then B contains either 


m”! or m^*! dollars. Since N has a geometric PMF, we have 


P(Y =m! |X =m”) P(Y2m"!X-m" P(N-n-«l) 1 


P(Y =m®-!|X=m"”) P(Y2m"-!X-2m"  P(N2m) 2 


Thus 
n—1 n 2 n+l n 1 
P(Y =m"! |X =m")=5, P(Y =m" |X =m")=5, 
and 
: = n1 2 n—-l 1 nti 24m? n 
E[amount in B| X =m"] = 3 m s peces 


We have (2 + m?)/3m > 1 if and only if m? — 3m + 2 > 0 or (m — 1)(m — 2) > 0. 
Thus if m » 2, then 


E[amount in B| X 2 m^] » m^, 


and to maximize the expected monetary gain you should always switch to B! 
What is happening in this example is that you switch for all values of z 
because 
E[Y |X = 1] > z, for all z. 


A naive application of the total expectation theorem might seem to indicate that 
E[Y] > E[X]. However, this cannot be true, since X and Y have identical PMFs. 
Instead, we have 

E[Y] = E[X] = oo, 


which is not necessarily inconsistent with the relation E[Y | X = z] > z for all x. 

The conclusion is that the decision rule that switches if and only if E[Y | X = 
z| > x does not improve the expected monetary gain in the case where E[Y] = 
E[X] — oo, and the apparent paradox is resolved. 


2.7 
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INDEPENDENCE 


We now discuss concepts of independence related to random variables. These 
are analogous to the concepts of independence between events (cf. Chapter 1). 
They are developed by simply introducing suitable events involving the possible 
values of various random variables, and by considering the independence of these 
events. 


Independence of a Random Variable from an Event 


The independence of a random variable from an event is similar to the indepen- 
dence of two events. The idea is that knowing the occurrence of the conditioning 
event provides no new information on the value of the random variable. More 
formally, we say that the random variable X is independent of the event A 
if 
P(X = z and A) = P(X = z)P(A) = px(z)P(A), for all zx. 

which is the same as requiring that the two events (X = x} and A be indepen- 
dent, for any choice r. From the definition of the conditional PMF, we have 


P(X = z and A) = pxja(z)P(A). 
so that as long as P(A) > 0, independence is the same as the condition 


Px|A(x) = px (2), for all z. 


Example 2.19. Consider two independent tosses of a fair coin. Let X be the 
number of heads and let A be the event that the number of heads is even. The 
(unconditional) PMF of X is 


1/4. ifz=0. 
px(z)=¢ 1/2, ifr=1, 
1/4, ifr-2. 


and P(A) = 1/2. The conditional PMF is obtained from the definition pxjA(r) = 
P(X = z and A)/P(A): 


1/2, ifz-0, 
1/2, ifz=2. 


Clearly, X and A are not independent, since the PMFs px and px,A are different. 
For an example of a random variable that is independent of A, consider the random 
variable that takes the value 0 if the first toss is a head. and the value 1 if the first 
toss is a tail. This is intuitively clear and can also be verified by using the definition 
of independence. 
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Independence of Random Variables 


The notion of independence of two random variables is similar to the indepen- 
dence of a random variable from an event. We say that two random variables 
X and Y are independent if 


Dpx,Y (z.y) = px(z) pv (y). for all z, y. 


This is the same as requiring that the two events (X = z} and (Y = y) be in- 
dependent for every z and y. Finally, the formula px,y (z, y) = pxiv(z | y)pv (y) 
shows that independence is equivalent to the condition 


px|v(z| y) = px(z). for all y with py (y) > 0 and all z. 


Intuitively, independence means that the value of Y provides no information on 
the value of X. 

There is a similar notion of conditional independence of two random vari- 
ables, given an event A with P(A) > 0. The conditioning event A defines a new 
universe and all probabilities (or PMFs) have to be replaced by their conditional 
counterparts. For example, X and Y are said to be conditionally indepen- 
dent, given a positive probability event A, if 


P(X =2,Y =y|A) = P(X = z| APY =y| A), for all x and y, 
or, in this chapter’s notation, 
PX,Y|A(Z,Y) = pxja(£)pyja (y), for all z and y. 
Once more, this is equivalent to 
Px|v.A(x| y) = pxja(z) for all z and y such that pyja(y) > 0. 


As in the case of events (Section 1.5), conditional independence may not imply 
unconditional independence and vice versa. This is illustrated by the example 
in Fig. 2.15. 

If X and Y are independent random variables, then 


E[X Y] = E[X] E|Y], 
as shown by the following calculation: 


E[XY] = Y Y 1 cypx,y(z,y) 
- Y zypx (z)pv (y) (by independence) 


= M spx(z) >) upy (v) 


= E[X] E[Y]. 
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Figure 2.15: Example illustrating that conditional independence may not imply 
unconditional independence. For the PMF shown, the random variables X and 
Y are not independent. For example. we have 


Pxiv(1]1]) = P(X 2 1]Y 21) =0 4 P(X = 1) =px(1). 


On the other hand, conditional on the event A = (X € 2. Y > 3) (the shaded 
set in the figure), the random variables X and Y can be seen to be independent. 
In particular, we have 


1/3, ifz=1, 
Pxiv alo) = C3 icd 


for both values y = 3 and y = 4. 


A very similar calculation also shows that if X and Y are independent, then 


for any functions g and h. In fact, this follows immediately once we realize that 
if X and Y are independent, then the same is true for g(X) and h(Y ). This is 
intuitively clear and its formal verification is left as an end-of-chapter problem. 

Consider now the sum X + Y of two independent random variables X and 
Y, and let us calculate its variance. Since the variance of a random variable is 
unchanged when the random variable is shifted by a constant, it is convenient to 
work with the zero-mean random variables X = X — E[X] and Y = Y — E[Y]. 


We have "M 
var( X 4- Y) 


-E[(X & Y)?] 

= E[X?+2X Y + Y?] 

= E[X?] + 2E[X Y] + E[Y?] 
E[X?] + E(Y?] 

= var(X) + var(Y) 

var( X) + var(Y ). 


var( X + Y) 
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We have used above the property E[X Y] = 0. which is justified as follows. The 
random variables X = X — E[X] and Y = Y — E[Y] are independent (because 
they are functions of the independent random variables X and Y). and since 
they also have zero-mean. we obtain 


E[X Y] = E[X] E[Y] = 0. 


In conclusion. the variance of the sum of two independent random variables 
is equal to the sum of their variances. For an interesting comparison, note that 
the mean of the sum of two random variables is always equal to the sum of their 
means. even if they are not independent. 






Summary of Facts About Independent Random Variables 






Let A be an event, with P(A) > 0, and let X and Y be random variables 
associated with the same experiment. 





e X is independent of the event A if 







px|A(z) = px(z), for all z, 





that is, if for all z, the events (X = zx} and A are independent. 


e X and Y are independent if for all pairs (x,y), the events (X = zx} 
and (Y = y) are independent, or equivalently 







px,y(z,y) = px(x)pv (y), for all z, y. 






e If X and Y are independent random variables, then 


E[XY] = E|X] EY]. 








Furthermore, for any functions g and h, the random variables g( X) 
and h(Y) are independent, and we have 


E[g( X)^(Y)] = E[g(X)] E[A(Y)]. 
e If X and Y are independent, then 


var(X + Y) = var(X) + var(Y). 





Independence of Several Random Variables 


The preceding discussion extends naturally to the case of more than two random 
variables. For example. three random variables X. Y. and Z are said to be 
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independent if 


DPx.v.z(z.y.z) = px(x)pv(y)pz(). for all x.y. z. 


If X, Y . and Z are independent random variables. then any three random 
variables of the form f (X). g(Y ). and h(Z), are also independent. Similarly. any 
two random variables of the form g(.X, Y) and h(Z) are independent. On the 
other hand. two random variables of the form g(.X, Y) and h(Y, Z) are usually 
not independent because they are both affected by Y. Properties such as the 
above are intuitively clear if we interpret independence in terms of noninter- 
acting (sub)experiments. They can be formally verified but this is sometimes 
tedious. Fortunately, there is general agreement between intuition and what is 
mathematically correct. This is basically a testament that our definitions of 
independence adequately reflect the intended interpretation. 


Variance of the Sum of Independent Random Variables 


Sums of independent random variables are especially important in a variety of 
contexts. For example. they arise in statistical applications where we "average" 
a number of independent measurements, with the aim of minimizing the effects 
of measurement errors. They also arise when dealing with the cumulative effect 
of several independent sources of randomness. We provide some illustrations in 
the examples that follow and we will also return to this theme in later chapters. 

In the examples below, we will make use of the following key property. If 
Xi, X2..... Xn are independent random variables, then 


var(X4 + X2+---+ Xn) = var(X1) + var(X2) +--+ + var( X4). 


This can be verified by repeated use of the formula var(.X +Y ) = var(X)+var(Y) 
for two independent random variables X and Y. 


Example 2.20. Variance of the Binomial and the Poisson. We consider n 
independent coin tosses. with each toss having probability p of coming up a head. 
For each i, we let X, be the Bernoulli random variable which is equal to 1 if the 
ith toss comes up a head, and is 0 otherwise. Then. X = X1 + X2+---+ Xn 
is a binomial random variable. Its mean is E[X] — np. as derived in Example 
2.10. By the independence of the coin tosses. the random variables X;,.... X, are 
independent, and 


var(X) — X var(X.) = np(1- p). 


i=l 


As we discussed in Section 2.2. a Poisson random variable Y with parameter 
Aà can be viewed as the “limit” of the binomial as n — oc. p — 0. while np = A. 
Thus. taking the limit of the mean and the variance of the binomial. we informally 
obtain the mean and variance of the Poisson: E[Y] = var(Y) = A. We have indeed 
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verified the formula E[Y] = A in Example 2.7. To verify the formula var(Y) = A. 





we write 
2) = Leo 
eg ^M 
x yet -— 1)! 
es 
= Mes ijs 
(EY +1) 
(A + 1), 
from which 


var(Y) = E[Y?] - (EY]) = AA +1) - 3* =A. 


The formulas for the mean and variance of a weighted sum of random 
variables form the basis for many statistical procedures that estimate the mean 
of a random variable by averaging many independent samples. A typical case is 
illustrated in the following example. 


Example 2.21. Mean and Variance of the Sample Mean. We wish to 
estimate the approval rating of a president, to be called B. To this end, we ask n 
persons drawn at random from the voter population, and we let X; be a random 
variable that encodes the response of the ith person: 


{ 1, if the ith person approves B's performance, 
X. ' : : s 
0, if the ith person disapproves B's performance. 


We model Xi, X2,..., , Xn as independent Bernoulli random variables with common 
mean p and variance pü — p). Naturally, we view p as the true approval rating of 
B. We "average" the responses and compute the sample mean Sn, defined as 


Xi c X2 +--+ Xn 
EL LI 


Sn = 


Thus, the random variable S, is the approval rating of B within our n-person 
sample. 
We have, using the linearity of Sn as a function of the Xi, 


ES] = 9 ZEX] => 7» - p 
i=1 i=1 


and making use of the independence of X1,..., Xn, 


n 


var(Sn) = v -z var(X:) = 


iz1 


p1—») 
n 
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The sample mean Sn can be viewed as a “good” estimate of the approval rating. 

This is because it has the correct expected value, which is the approval rating p, and 

its accuracy, as reflected by its variance, improves as the sample size n increases. 
Note that even if the random variables X; are not Bernoulli, the same calcu- 


lation yields 
vais wore) 


n 





as long as the X; are independent, with common mean E[X] and variance var(X). 
Thus, again, the sample mean becomes a good estimate (in terms of variance) of the 
true mean E[X], as the sample size n increases. We will revisit the properties of the 
sample mean and discuss them in much greater detail in Chapter 5, in connection 
with the laws of large numbers. 


Example 2.22. Estimating Probabilities by Simulation. In many practical 
situations, the analytical calculation of the probability of some event of interest is 
very difficult. However, if we have a physical or computer model that can generate 
outcomes of a given experiment in accordance with their true probabilities, we 
can use simulation to calculate with high accuracy the probability of any given 
event A. In particular, we independently generate with our model n outcomes, 
we record the number m of outcomes that belong to the event A of interest, and 
we approximate P(A) by m/n. For example, to calculate the probability p = 
P(Heads) of a coin, we toss the coin n times, and we approximate p with the ratio 
(number of heads recorded) /n. 

To see how accurate this process is, consider n independent Bernoulli random 
variables X,,...,Xn, each with PMF 


_ fP(A,  ifk=1, 
pc s if k = 0. 


In a simulation context, X; corresponds to the ith outcome, and takes the value 1 


if the ith outcome belongs to the event A. The value of the random variable 


= Xi qe EAR 
n 


X 


is the estimate of P(A) provided by the simulation. According to Example 2.21, X 
has mean P(A) and variance P(A)(1 -— P(A))/n, so that for large n, it provides an 
accurate estimate of P(A). 


2.8 SUMMARY AND DISCUSSION 


Random variables provide the natural tools for dealing with probabilistic models 
in which the outcome determines certain numerical values of interest. In this 
chapter, we focused on discrete random variables, and developed a conceptual 
framework and some relevant tools. 

In particular, we introduced concepts such as the PMF, the mean, and the 
variance, which describe in various degrees of detail the probabilistic character 
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of a discrete random variable. We showed how to use the PMF of a random 
variable X to calculate the mean and the variance of a related random variable 
Y = g(X) without calculating the PMF of Y. In the special case where g is 
a linear function. Y = aX + b. the means and the variances of X and Y are 
related by 


E[Y] = aE[X] +b, var(Y) = a2var(X). 


We also discussed several special random variables, and derived their PMF. 
mean, and variance. as summarized in the table that follows. 





Summary of Results for Special Random Variables 


Discrete Uniform over [a,b]: 


1 
——, ifk=a, 1.25; 
ni = Liza i icd 
0, otherwise, 





E[x| = 2 5 d var(X) = aS 


Bernoulli with Parameter p: (Describes the success or failure in a single 
trial.) 


_ Íp, IE 
px) - Un, if k — 0, 
E[X] — p, var(X) — p(1 - p). 


Binomial with Parameters p and n: (Describes the number of successes 
in n independent Bernoulli trials.) 


px(k) = (Sra — p)^-*. kdo zd. 
E[X] = np, var( X) = np(1 — p). 


Geometric with Parameter p: (Describes the number of trials until the 
first success, in a sequence of independent Bernoulli trials.) 


px (k) = (1 — p)*-!p, k=1,2,..., 
1 
D p? 





E[X] = 
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Poisson with Parameter A: (Approximates the binomial PMF when n 
is large, p is small, and A = np.) 





We also considered multiple random variables, and introduced joint PMFs, 
conditional PMFs, and associated expected values. Conditional PMFs are often 
the starting point in probabilistic models and can be used to calculate other 
quantities of interest, such as marginal or joint PMFs and expectations, through 
a sequential or a divide-and-conquer approach. In particular. given the condi- 
tional PMF pyy (1 | y): 


(a) The joint PMF can be calculated by 


px.y (x.y) = py (y)Pxiv (| y). 


This can be extended to the case of three or more random variables. as in 


Dx.v.z(z.y.z) = pz(z)pyz(y| z)pxy.z(£ |y. z). 


and is analogous to the sequential tree-based calculation method using the 
multiplication rule. discussed in Chapter 1. 


(b) The marginal PMF can be calculated by 


px (z) = 9 1 pv(y)pxiv (zl y). 
y 


which generalizes the divide-and-conquer calculation method we discussed 
in Chapter 1. 


(c) The divide-and-conquer calculation method in (b) above can be extended 
to compute expected values using the total expectation theorem: 


E[X] = Y pr ()E[X | Y = y]. 


We introduced the notion of independence of random variables, in analogy 
with the notion of independence of events. Among other topics. we focused on 
random variables X obtained by adding several independent random variables 
X 14:92:55 X n: 

X2Xp tcc Xn. 
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We argued that the mean and the variance of the sum are equal to the sum of 
the means and the sum of the variances, respectively: 


E[X] = E[Xi] t --- - E[X;]. var( X) = var(X1) +--+  var( Xs). 


The formula for the mean does not require independence of the Xj, but the 
formula for the variance does. 

The concepts and methods of this chapter extend appropriately to general 
random variables (see the next chapter), and are fundamental for our subject. 
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PROBLEMS 





SECTION 2.2. Probability Mass Functions 


Problem 1. The MIT soccer team has 2 games scheduled for one weekend. It has 
a 0.4 probability of not losing the first game. and a 0.7 probability of not losing the 
second game, independent of the first. If it does not lose a particular game, the team 
is equally likely to win or tie. independent of what happens in the other game. The 
MIT team will receive 2 points for a win, 1 for a tie. and 0 for a loss. Find the PMF 
of the number of points that the team earns over the weekend. 


Problem 2. You go to a party with 500 guests. What is the probability that 
exactly one other guest has the same birthday as you? Calculate this exactly and 
also approximately by using the Poisson PMF. (For simplicity. exclude birthdays on 
February 29.) 


Problem 3. Fischer and Spassky play a chess match in which the first player to win 
a game wins the match. After 10 successive draws. the match is declared drawn. Each 
game is won by Fischer with probability 0.4. is won by Spassky with probability 0.3. 
and is a draw with probability 0.3. independent of previous games. 


(a) What is the probability that Fischer wins the match? 
(b) What is the PMF of the duration of the match? 


Problem 4. An internet service provider uses 50 modems to serve the needs of 1000 
customers. It is estimated that at a given time. each customer will need a connection 
with probability 0.01, independent of the other customers. 


(a) What is the PMF of the number of modems in use at the given time? 


(b) Repeat part (a) by approximating the PMF of the number of customers that 
need a connection with a Poisson PMF. 


(c) What is the probability that there are more customers needing a connection than 
there are modems? Provide an exact. as well as an approximate formula based 
on the Poisson approximation of part (b). 


Problem 5. A packet communication system consists of a buffer that stores packets 
from some source, and a communication line that retrieves packets from the buffer and 
transmits them to a receiver. The system operates in time-slot pairs. In the first slot, 
the system stores a number of packets that are generated by the source according to 
a Poisson PMF with parameter A; however, the maximum number of packets that can 
be stored is a given integer b, and packets arriving to a full buffer are discarded. In the 
second slot, the system transmits either all the stored packets or c packets (whichever 
is less). Here, c is a given integer with 0 < c < b. 
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(a) Assuming that at the beginning of the first slot the buffer is empty, find the PMF 
of the number of packets stored at the end of the first slot and at the end of the 
second slot. 


(b) What is the probability that some packets get discarded during the first slot? 


Problem 6. The Celtics and the Lakers are set to play a playoff series of n basketball 
games, where n is odd. The Celtics have a probability p of winning any one game, 
independent of other games. 


(a) Find the values of p for which n — 5 is better for the Celtics than n — 3. 


(b) Generalize part (a), i.e., for any k > 0, find the values for p for which n = 2k + 1 
is better for the Celtics than n = 2k — 1. 


Problem 7. You just rented a large house and the realtor gave you 5 keys, one for 
each of the 5 doors of the house. Unfortunately, all keys look identical. so to open the 
front door, you try them at random. 


(a) Find the PMF of the number of trials you will need to open the door, under the 
following alternative assumptions: (1) after an unsuccessful trial. you mark the 
corresponding key. so that you never try it again. and (2) at each trial you are 
equally likely to choose any key. 


(b) Repeat part (a) for the case where the realtor gave you an extra duplicate key 
for each of the 5 doors. 


Problem 8. Recursive computation of the binomial PMF. Let X be a binomial 
random variable with parameters n and p. Show that its PMF can be computed by 
starting with px (0) = (1 — p)". and then using the recursive formula 


| p n-k 
PAS) T Ga ud 





: px (k), k —0,1....,n-— I. 


Problem 9. Form of the binomial PMF. Consider a binomial random variable 
X with parameters n and p. Let k* be the largest integer that is less than or equal 
o (n+ 1)p. Show that the PMF px(k) is monotonically nondecreasing with k in the 
range from 0 to k*. and is monotonically decreasing with k for k > k*. 


Problem 10. Form of the Poisson PMF. Let X be a Poisson random variable 
with parameter A. Show that the PMF px(k) increases monotonically with k up to 
the point where k reaches the largest integer not exceeding A, and after that point 
decreases monotonically with k. 


Problem 11.* The matchbox problem - inspired by Banach's smoking 
habits. A smoker mathematician carries one matchbox in his right pocket and one in 
his left pocket. Each time he wants to light a cigarette, he selects a matchbox from 
either pocket with probability p — 1/2, independent of earlier selections. The two 
matchboxes have initially n matches each. What is the PMF of the number of remain- 
ing matches at the moment when the mathematician reaches for a match and discovers 
that the corresponding matchbox is empty? How can we generalize to the case where 
the probabilities of a left and a right pocket selection are p and 1 — p, respectively? 
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Solution. Let X be the number of matches that remain when a matchbox is found 
empty. For k = 0,1,...,n, let Le (or Rx) be the event that an empty box is first 
discovered in the left (respectively. right) pocket while the number of matches in the 
right (respectively, left) pocket is k at that time. The PMF of X is 


px(k) = P(L&) + P(R;). k =0,1,...,n. 


Viewing a left and a right pocket selection as a ^success" and a "failure," respectively, 
P(L.,) is the probability that there are n successes in the first 2n — k trials, and trial 
2n — k +1 is a success. or 


1/2n — kV (1\2"-* 
Pits) = 3( )G) ` k=0,1,...,n. 


n 


By symmetry. P(Li.) = P( Rx). so 


px(k) = P(Lx) + P(R&) = ee ‘) Oz PET 


In the more general case, where the probabilities of a left and a right pocket 
selection are p and 1 — p, using a similar reasoning. we obtain 


Pu.) = (7 "ra -»*7*. k 20.1,....n. 


and 


2n — k n— n 
e) = a -»(- )p "Q-py. — k=0,1,...50, 
which yields 
px(k) = P(L&) + P(Rx) 


" G - d (p 0-57» *ü-p"")  kz01...n. 
n 


Problem 12.* Justification of the Poisson approximation property. Con- 
sider the PMF of a binomial random variable with parameters n and p. Show that 
asymptotically, as 

n — oo, p— 0. 


while np is fixed at a given value A, this PMF approaches the PMF of a Poisson random 
variable with parameter A. 
Solution. Using the equation A = np, write the binomial PMF as 


Px (k) = c parl ips 


n* k! 
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Fix k and let n — oc. We have, for j = 1,....k, 


Dc cx (1-3) — ]. (1-2) pm 


n 


Thus, for each fixed k, as n — oo we obtain 


-aÀ 
px(k) +e S 


SECTION 2.3. Functions of Random Variables 


Problem 13. A family has 5 natural children and has adopted 2 girls. Each natural 
child has equal probability of being a girl or a boy, independent of the other children. 
Find the PMF of the number of girls out of the 7 children. 


Problem 14. Let X be a random variable that takes values from 0 to 9 with equal 
probability 1/10. 

(a) Find the PMF of the random variable Y = X mod(3). 

(b) Find the PMF of the random variable Y = 5 mod(X + 1). 
Problem 15. Let K bearandom variable that takes, with equal probability 1/(2n+1), 


the integer values in the interval [—n,n]. Find the PMF of the random variable Y = 
In X. where X = a!*!. and a is a positive number. 


SECTION 2.4. Expectation, Mean, and Variance 


Problem 16. Let X be a random variable with PMF 


puters z’/a, if z = —3, —2, —1,0, 1, 2,3, 
0, otherwise. 


(a) Find a and E[X]. 


Using the result from part (b), find the variance of X. 


(c 


) 
(b) What is the PMF of the random variable Z = (X — E[X]) : 
) 
(d) Find the variance of X using the formula var(X) 257, (x — E[X]) * px (x). 


Problem 17. A city's temperature is modeled as a random variable with mean and 
standard deviation both equal to 10 degrees Celsius. A day is described as *normal" if 
the temperature during that day ranges within one standard deviation from the mean. 
What would be the temperature range for a normal day if temperature were expressed 
in degrees Fahrenheit? 


Problem 18. Let a and b be positive integers with a € b, and let X be a random 
variable that takes as values, with equal probability, the powers of 2 in the interval 
[2^, 2°]. Find the expected value and the variance of X. 
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Problem 19. A prize is randomly placed in one of ten boxes, numbered from 1 to 10. 
You search for the prize by asking yes-no questions. Find the expected number of 
questions until you are sure about the location of the prize, under each of the following 
strategies. 


(a) An enumeration strategy: you ask questions of the form "is it in box k?". 


(b) A bisection strategy: you eliminate as close to half of the remaining boxes as 
possible by asking questions of the form “is it in a box numbered less than or 
equal to k?”. 


Solution. We will find the expected gain for each strategy, by computing the expected 
number of questions until we find the prize. 


(a) With this strategy, the probability 1/10 of finding the location of the prize with i 


questions, where i — 1,...,10, is 1/10. Therefore, the expected number of questions is 
Ly j 55 = 5.5 
ines 719 v7 


(b) It can be checked that for 4 of the 10 possible box numbers, exactly 4 questions 
will be needed, whereas for 6 of the 10 numbers, 3 questions will be needed. Therefore, 
with this strategy, the expected number of questions is 


Problem 20. As an advertising campaign, a chocolate factory places golden tickets 
in some of its candy bars, with the promise that a golden ticket is worth a trip through 
the chocolate factory, and all the chocolate you can eat for life. If the probability of 
finding a golden ticket is p, find the mean and the variance of the number of candy 
bars you need to eat to find a ticket. 


Problem 21. St. Petersburg paradox. You toss independently a fair coin and you 
count the number of tosses until the first tail appears. If this number is n, you receive 
2" dollars. What is the expected amount that you will receive? How much would you 
be willing to pay to play this game? 


Problem 22. Two coins are simultaneously tossed until one of them comes up a head 
and the other a tail. The first coin comes up a head with probability p and the second 
with probability q. All tosses are assumed independent. 


(a) Find the PMF, the expected value, and the variance of the number of tosses. 


(b) What is the probability that the last toss of the first coin is a head? 


Problem 23. 


(a) A fair coin is tossed repeatedly and independently until two consecutive heads 
or two consecutive tails appear. Find the PMF, the expected value, and the 
variance of the number of tosses. 
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(b) Assume now that the coin is tossed until we obtain a tail that is immediately 
preceded by a head. Find the PMF and the expected value of the number of 
tosses. 


SECTION 2.5. Joint PMFs of Multiple Random Variables 


Problem 24. A stock market trader buys 100 shares of stock A and 200 shares of 
stock B. Let X and Y be the price changes of A and B. respectively. over a certain time 
period. and assume that the joint PMF of X and Y is uniform over the set of integers 
r and y satisfying 

—2<27<4, -l<y-2z<l. 


(a) Find the marginal PMFs and the means of X and Y. 
(b) Find the mean of the trader's profit. 


Problem 25. A class of n students takes a test consisting of m questions. Suppose 
that student submitted answers to the first m, questions. 


(a) The grader randomly picks one answer, call it (7, J), where J is the student ID 
number (taking values 1,...,n) and J is the question number (taking values 
1....,m). Assume that all answers are equally likely to be picked. Calculate the 
joint and the marginal PMFs of J and J. 


(b) Assume that an answer to question j. if submitted by student i, is correct with 
probability p,;. Each answer gets a points if it is correct and gets b points 
otherwise. Calculate the expected value of the score of student i. 


Problem 26. PMF of the minimum of several random variables. On a 
given day. your golf score takes values from the range 101 to 110. with probability O.1, 
independent of other days. Determined to improve your score, you decide to play on 
three different days and declare as your score the minimum X of the scores X1, X2, 
and X3; on the different days. 


(a) Calculate the PMF of X. 
(b) By how much has your expected score improved as a result of playing on three 


days? 


Problem 27.* The multinomial distribution. A die with r faces, numbered 
l,....r. is rolled a fixed number of times n. The probability that the ith face comes up 
on any one roll is denoted p,, and the results of different rolls are assumed independent. 
Let X, be the number of times that the ith face comes up. 


(a) Find the joint PMF px,.....x,(ki,...,kr). 
(b) Find the expected value and variance of Xi. 
(c) Find E[X, X;] for i z j. 


Solution. (a) The probability of a sequence of rolls where, for i = 1,...,r, face i comes 
up k; times is pi ...p*r Every such sequence determines a partition of the set of n 
rolls into r subsets with the ith subset having cardinality k; (this is the set of rolls 
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for which the ith face came up). The number of such partitions is the multinomial 
coefficient (cf. Section 1.6) 


and otherwise, px,.....xc (ki... kr) = 0. 


(b) The random variable X, is binomial with parameters n and p,. Therefore, E[X;] = 
np;, and var(X;) = npi(1 — pi). 


(c) Suppose that i # j, and let Yi, (or Y3) be the Bernoulli random variable that 
takes the value 1 if face i (respectively, 7) comes up on the kth roll. and the value 0 
otherwise. Note that Y; ,Y;,. = 0, and that for l Æ k. Y; and Y; are independent, so 
that E[Y;.Y;;] = p;p;. Therefore, 


E[X.X;] = E[(Yia ++ Yn) (Ya tee Y,4)] 
= n(n — 1)E[Y; 1 Y},2] 
= n(n — 1)pip;. 


Problem 28.* The quiz problem. Consider a quiz contest where a person is given 
a list of n questions and can answer these questions in any order he or she chooses. 
Question i will be answered correctly with probability pi. and the person will then 
receive a reward vi. At the first incorrect answer, the quiz terminates and the person 
is allowed to keep his or her previous rewards. The problem is to choose the ordering 
of questions so as to maximize the expected value of the total reward obtained. Show 
that it is optimal to answer questions in a nonincreasing order of pivi/(1 — pi). 


Solution. We will use a so-called interchange argument, which is often useful in 
sequencing problems. Let i and j be the kth and (k + 1)st questions in an optimally 
ordered list 

L = (Rig cis thy i, J ik42, uia) 


Consider the list 
L = (Bi yan oy Bes i f ik+2,.-. sin) 


obtained from L by interchanging the order of questions i and j. We compute the 
expected values of the rewards of L and L’, and note that since L is optimally ordered, 


we have 
E[reward of L) > E[reward of £L]. 


Define the weight of question i to be 


piti 


w(i)- ert 
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We will show that any permutation of the questions in a nonincreasing order of weights 
maximizes the expected reward. 

If L = (i1,...,in) is a permutation of the questions, define L(* to be the permu- 
tation obtained from L by interchanging questions ix and 2%41. Let us first compute 
the difference between the expected reward of L and that of L9. We have 


Ej[reward of L| = pi vi, + pi Pigg 7 Pi oo Pisis, 


and 
E [reward of E = piv + Pij Pig Vig toc pia oo Pu aia 
+ Dii Xd "Pig Pings Ving + Di, oe *Prp_ Pigg 1 Pry Vik 
+ Pi Ping gViggg T oo + Diy Pin Vin: 
Therefore, 


k 
E [reward of L/ | — Efreward of L| = pi, ++- pi, , (Diggs Viggy + Pig a Pig Vik 
— Pig Vi, — Pig Pie gs Ping) 
= pi o paa (1 — pig )(1 — pi uu) (w(iet1) — wis). 


Now, let us go back to our problem. Consider any permutation L of the questions. 
If w(tx) < w(ix41) for some k, it follows from the above equation that the permutation 
L“) has an expected reward larger than that of L. So, an optimal permutation of the 
questions must be in a nonincreasing order of weights. 

Let us finally show that any twosuch permutations have equal expected rewards. 
Assume that L is such a permutation and say that w(ix) = w(ix41) for some k. We 
know that interchanging ix and ?&41 preserves the expected reward. So, the expected 
reward of any permutation L' in a non-increasing order of weights is equal to that of 
L, because L’ can be obtained from L by repeatedly interchanging adjacent questions 
having equal weights. 


Problem 29.* The inclusion-exclusion formula. Let Ai, A2,...,An be events. 
Let Sı = {i|1 < i € n), S2 = {(i1,i2)|1 € à < ig € n), and more generally, let Sm 
be the set of all m-tuples (i1,...,im) of indices that satisfy 1 < 11 < i2 < +++ < im € n. 
Show that 


P(U&iA)- M, P(A)- M. P(A N An) 
ES) (i1.12)€ S2 
* M P(A,DA4D Aig) - (71) P (NR Ax). 
(11:i2,13)€$3 
Hint: Let X; be a binary random variable which is equal to 1 when A; occurs, and 


equal to 0 otherwise. Relate the event of interest to the random variable (1 — X1i)(1— 
X2)--- (1 — Xn). 


Solution. Let us express the event B = U£.,A. in terms of the random variables 
X3,..., Xn. The event B^ occurs when all of the random variables X;,..., Xn are zero, 
which happens when the random variable Y = (1— X1)(1— X3) --- (1— Xn) is equal to 1. 
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Note that Y can only take values in the set (0,1), so that P(B^) = P(Y = 1) = E[Y]. 
Therefore, 


P(B) = 1—E[(1— Xi) — X2) -- (1 - X«)] 


SD dex] E 5 p XC E TE dy UC e 36]: 
(11:12)€32 


We note that 
E[X;] = P(A:), E(X;, Xa] = P(A; N Aig); 
E[Xi; Xi Xiz] = P(A N An NAi), E[XiXo Xn] = P(Nk=14x), 


etc., from which the desired formula follows. 


Problem 30.* Alvin’s database of friends contains n entries, but due to a software 
glitch, the addresses correspond to the names in a totally random fashion. Alvin writes 
a holiday card to each of his friends and sends it to the (software-corrupted) address. 
What is the probability that at least one of his friends will get the correct card? Hint: 
Use the inclusion-exclusion formula. 


Solution. Let A, be the event that the kth card is sent to the correct address. We 
have for any k, 7, i, 














n-1 n! 
1 1 1 (n — 3)! 
P(A; N A, N Ai) = =. —- = ; 
(Ak ? n n—1 n-2 n! 
etc., and 
" 1 
P(^-1Ax) = nV 
Applying the inclusion-exclusion formula, 
P (UiziAk) = Pa. - M P(Ai, N Aig) 


i€Sy (i1,12)€ 52 

n-1 n 
$ > P(Ai, n Aig n Aig) — --: + (21771 P (DEA Ag) 
(i1,12,13)ES3 


we obtain the desired probability 


sel) eee ce) e cams 


1 1l — 
sueco cp) 











When n is large, this probability can be approximated by 1 — e^!. 
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SECTION 2.6. Conditioning 


Problem 31. Consider four independent rolls of a 6-sided die. Let X be the number 
of 1s and let Y be the number of 2s obtained. What is the joint PMF of X and Y? 


Problem 32. D. Bernoulli's problem of joint lives. Consider 2m persons 
forming m couples who live together at a given time. Suppose that at some later time, 
the probability of each person being alive is p, independent of other persons. At that 
later time. let A be the number of persons that are alive and let S be the number of 
couples in which both partners are alive. For any survivor number a, find E[S | A = a]. 


Problem 33.* A coin that has probability of heads equal to p is tossed successively 
and independently until a head comes twice in a row or a tail comes twice in a row. 
Find the expected value of the number of tosses. 


Solution. One possibility here is to calculate the PMF of X, the number of tosses 
until the game is over, and use it to compute E[X]. However, with an unfair coin, this 
turns out to be cumbersome. so we argue by using the total expectation theorem and 
a suitable partition of the sample space. Let Hy (or Tk) be the event that a head (or a 
tail, respectively) comes at the kth toss. and let p (respectively, g) be the probability 
of H, (respectively, Tk). Since Hı and T, form a partition of the sample space, and 
P(H) = p and P(T;) =q, we have 


E[X] = pE(X | Hi] + gE[X |T;]. 
Using again the total expectation theorem, we have 
E[X | i] = pE[X | Hi N Ho] + gE[X | Hı N Tə) = 2p + q(1 + E[X | T1]), 


where we have used the fact 
E[X | Hi H2] 22 


(since the game ends after two successive heads), and 
E[X | Hi n T9] 2 1 - E[X | T1] 


(since if the game is not over. only the last toss matters in determining the number of 
additional tosses up to termination). Similarly, we obtain 


E[X | T1] = 29 + p(1 + E(X | Hi]). 


Combining the above two relations. collecting terms, and using the fact p+ q = 1, we 
obtain after some calculation 








2+p 
E[X | r1] = ; 
[X | Ti] ge 
and similarly 
244 
E[X | Mi] = = 


1 — pq 
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Thus, 
24g? 24-p 
E[X] = p. ——— +q: ——, 
[X] =p pes) P 
and finally, using the fact p - q — 1, 


2+ pq 
1-pq 





E[X] = 


In the case of a fair coin (p = q = 1/2), we obtain E[X] = 3. It can also be verified 
that 2 < E[X] < 3 for all values of p. 


Problem 34.* A spider and a fly move along a straight line. At each second, the fly 
moves a unit step to the right or to the left with equal probability p, and stays where 
it is with probability 1 — 2p. The spider always takes a unit step in the direction of the 
fly. The spider and the fly start D units apart, where D is a random variable taking 
positive integer values with a given PMF. If the spider lands on top of the fly, it's the 
end. What is the expected value of the time it takes for this to happen? 


Solution. Let T' be the time at which the spider lands on top of the fly. We define 
Aa: the event that initially the spider and the fly are d units apart. 
Ba: the event that after one second the spider and the fly are d units apart. 


Our approach will be to first apply the (conditional version of the) total expectation 
theorem to compute E[T | Ai], then use the result to compute E[T | A2]. and similarly 
compute sequentially E[T'| Aa] for all relevant values of d. We will then apply the 
(unconditional version of the) total expectation theorem to compute E[T]. 

We have 


Aa = (Aa N Ba) U (Aa N Ba-1) U (Aa  Ba-2), if d » 1. 
This is because if the spider and the fly are at a distance d > 1 apart, then one second 
later their distance will be d (if the fly moves away from the spider) or d — 1 (if the fly 


does not move) or d — 2 (if the fly moves towards the spider). We also have, for the 
case where the spider and the fly start one unit apart, 


Ai = (Ai N Bi) U (Ai N Bo). 
Using the total expectation theorem. we obtain 


E[T | Ag] = P(Ba| Ag) E[T | Aan B4] 
+ P(Ba- 1 | Aa)E[T | Aa N Ba-1] 
+ P(Ba-2| Aa)E[T | Aa N Ba-2], ifd>1, 


and 
E[T | Ai] = P(Bı | A:)E[T | 419 Bi] + P(Bo|Ai)E[T|AiN Bo, if d=1. 
It can be seen based on the problem data that 


P(Bi|A1)=2p, P(Bo|A1) = 1 — 2p, 
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E[T | Ain Bi] = 1 + E[T | Ai], E[T | Ai N Bo] = 1, 
so by applying the formula for the case d — 1, we obtain 
E(T | Ai] = 2p(1 + E[T | A1]) + (1 — 2p), 


or 





1 
E[T | Ai] = T= op 


By applying the formula with d = 2, we obtain 
E[T | A2] = pE[T | A2 N B2] + (1 — 2p)ET | Aon Bı] + pE[T | A2 N Bol. 


We have 
E[T | A2 N Bo] = 1, 


E[T | A20 Bi] = 1 + E[T | Ai]. 
E[T | A2 N B2] = 1 + E[T | A2], 


so by substituting these relations in the expression for E[T | A2], we obtain 


E[T | A7] = p(1  E(T | A2]) + (1 - 2p)(1+ E[T | Ai]) +p 





1 
T — 3 
=p(1 + E[T|A])+ (1 2p (14 x) 
This equation yields after some calculation 
2 
E[T = —. 
[T | A2] = 1— z 


Generalizing. we obtain for d > 2, 
E[T | Aa] = p(1 + E(T | Aa]) + (1 — 2p)(1 + E[T | Aa-1]) + p(1 + E[T | Aa-3]). 


Thus, E[T | A4] can be generated recursively for any initial distance d, using as initial 
conditions the values of E[T' | Ai] and E[T | A2] obtained earlier. 

Finally, the expected value of T' can be obtained using the given PMF for the 
initial distance D and the total expectation theorem: 


E[T] = 5 ' pp (d)EI[T | A4]. 


Problem 35.* Verify the expected value rule 
g(X.Y)) = 2,2, c, y)Px.v (7. y), 


using the expected value rule for a function of a single random variable. Then, use the 
rule for the special case of a linear function, to verify the formula 


Eja X + bY] = aE[X] + bE|Y], 
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where a and b are given scalars. 


Solution. We use the total expectation theorem to reduce the problem to the case of 
a single random variable. In particular, we have 


E[9(X,Y)] = 9 py ()E[gG Y) lY =y] 
- M py (y)E[s(X.y)1Y =v] 
=X pr(y) X o(z v)Pxiv(z| y) 


=X Moz. y)px.v (2,4), 


as desired. Note that the third equality above used the expected value rule for the 
function g( X, y) of a single random variable X. 
For the linear special case, the expected value rule gives 


E[aX + bY] = 9 X "(az + by)px v (z. v) 
= a) 29 pxy(z,y) + by y pxv(z, y) 


= aM zpx(z) + bY ypy (v) 


= aE[X] + bE[Y]. 


Problem 36.* The multiplication rule for conditional PMFs. Let X, Y, and 
Z be random variables. 


(a) Show that 
px.v.z(z, y, z) = px(z)pPvix(v | z)Pzix,v (z | z, y). 
(b) How can we interpret this formula as a special case of the multiplication rule 
given in Section 1.3? 
(c) Generalize to the case of more than three random variables. 


Solution. (a) We have 
Px y,2(Z,y,z) = P(X =2,Y =y, Z =z) 
= P(X =2)P(Y =y,Z=2z|X =2) 
=P(X =z)P(Y =y|X =2)P(Z =2z|X =2,Y =y) 
= px(z)pvix(y | z)Pzix.v (z | v. y). 
(b) The formula can be written as 


P(X =2,Y =y,Z =z) =P(X =2)P(Y =y|X =2)P(Z =2z|X =2,Y =y), 
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which is a special case of the multiplication rule. 


(c) The generalization is 


PXj Xa (£1... n) 


= Px, (x1)Pxo|X, (x2 | zi) Es "PX&5|X1 — Xn-—1 (Zn | Ti.ee jns). 


Problem 37.* Splitting a Poisson random variable. A transmitter sends out 
either a 1 with probability p, or a O with probability 1 — p. independent of earlier 
transmissions. If the number of transmissions within a given time interval has a Poisson 
PMF with parameter A, show that the number of 1s transmitted in that same time 
interval has a Poisson PMF with parameter pA. 


Solution. Let X and Y be the numbers of 1s and 0s transmitted. respectively. Let 
Z = X +Y be the total number of symbols transmitted. We have 


P(X =nY =m) =P(X =n Y=m|Z=n+m)P(Z=n+™m) 


n+m "(- ya. e AT 
n JP" P Umm) 


e ^P (Ap)" e ^ 0-» (Aa -p)' 
n! l m! : 
Thus. 


P(X =n) = M P(X-nY =m) 


n! 


e ""Qp)" Lap) Y oa - p)) 
! 
m=0 is 
e *?(Ap)” e a7 Pep) 


n! 


6 Pp)" 


n! 
so that X is Poisson with parameter Ap. 


SECTION 2.7. Independence 


Problem 38. Alice passes through four traffic lights on her way to work, and each 
light is equally likely to be green or red. independent of the others. 


(a) What is the PMF, the mean, and the variance of the number of red lights that 
Alice encounters? 


(b) Suppose that each red light delays Alice by exactly two minutes. What is the 


variance of Alices commuting time? 


Problem 39. Each morning, Hungry Harry eats some eggs. On any given morning. 
the number of eggs he eats is equally likely to be 1. 2, 3. 4, 5, or 6, independent of 
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what he has done in the past. Let X be the number of eggs that Harry eats in 10 days. 
Find the mean and variance of X. 


Problem 40. A particular professor is known for his arbitrary grading policies. Each 
paper receives a grade from the set (4, A—, B+. B, B—,C+}, with equal probability, 
independent of other papers. How many papers do you expect to hand in before you 
receive each possible grade at least once? 


Problem 41. You drive to work 5 days a week for a full year (50 weeks), and with 
probability p — 0.02 you get a traffic ticket on any given day, independent of other 
days. Let X be the total number of tickets you get in the year. 


(a) What is the probability that the number of tickets you get is exactly equal to the 
expected value of X? 


(b) Calculate approximately the probability in (a) using a Poisson approximation. 


(c) Any one of the tickets is $10 or $20 or $50 with respective probabilities 0.5, 0.3, 
and 0.2, and independent of other tickets. Find the mean and the variance of the 
amount of money you pay in traffic tickets during the year. 


(d) Suppose you don't know the probability p of getting a ticket. but you got 5 tickets 
during the year, and you estimate p by the sample mean 


5 
p= — = 0.02. 
P 250 
What is the range of possible values of p assuming that the difference between 
p and the sample mean p is within 5 times the standard deviation of the sample 
mean? 


Problem 42. Computational problem. Here is a probabilistic method for com- 
puting the area of a given subset S of the unit square. The method uses a sequence of 
independent random selections of points in the unit square [0, 1] x [0, 1], according to a 
uniform probability law. If the ith point belongs to the subset S the value of a random 
variable X; is set to 1, and otherwise it is set to 0. Let X, X2,... be the sequence of 
random variables thus defined, and for any n, let 


Xi Xa oc Xn 
mU 


Sn 


(a) Show that E[S;] is equal to the area of the subset S. and that var(S,) diminishes 
to 0 as n increases. 


(b) Show that to calculate Sn. it is sufficient to know S5-1 and Xn, so the past 
values of Xx, k = 1,...,n — 1, do not need to be remembered. Give a formula. 


(c) Write a computer program to generate S4, for n = 1,2,...,10000, using the 
computers random number generator, for the case where the subset S is the 
circle inscribed within the unit square. How can you use your program to measure 
experimentally the value of 7? 


(d) Use a similar computer program to calculate approximately the area of the set 
of all (x.y) that lie within the unit square and satisfy 0 € cos mz + sin Ty X 1. 
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Problem 43.* Suppose that X and Y are independent, identically distributed, geo- 
metric random variables with parameter p. Show that 

. 1 ; 

P(X =i|X+Y =n) = —., i —]1,...,n-— I. 
n-1 

Solution. Consider repeatedly and independently tossing a coin with probability of 
heads p. We can interpret P(X = i| X + Y = n) as the probability that we obtained 
a head for the first time on the ith toss given that we obtained a head for the second 
time on the nth toss. We can then argue, intuitively, that given that the second head 
occurred on the nth toss, the first head is equally likely to have come up at any toss 
between 1 and n — 1. To establish this precisely, note that we have 


P(X =i, X+Y =n) P(X =i)P(Y =n- i) 


RCE eb tS —bBOCEKSeRM- UOXOPOUCEEEG 


Also l 
P(X =i)=p1- p,  fori>1l, 


and l 
P(Y =n- ìi) =p(1- p)", fon-i2l 
It follows that 


p(1—p)'?, ifi—1,..,n- 1, 


P(X =i)P(Y = n- i) = { 0 otherwise. 


Therefore, for any i and j in the range [1, n — 1], we have 
P(X =i1[|X +Y =n) = P(X =j|X4+Y =n). 


Hence 


P(X =i|X+¥ =n) = — i=1,... = 1, 


=" 


Problem 44.* Let X and Y be two random variables with given joint PMF, and 
let g and h be two functions of X and Y, respectively. Show that if X and Y are 
independent, then the same is true for the random variables g(X) and h(Y). 


Solution. Let U = g(X) and V =h(Y). Then, we have 


pu,v (u, v) = 5y px,y(Z, y) 


{(z,y) | 9(z)=u, h(y)=v} 


7 5y px (z)pv (y) 


{(z.y) | 9(z)=u, h(y)=v} 


= M px(z) J. pry) 


{z | 9(z)=u} {z|h(y)=v} 


= pu(u)pv(v), 


so U and V are independent. 
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Problem 45.* Variability extremes. Let X;,..., Xn be independent random vari- 
ables and let X = X; +---+ Xn be their sum. 


(a) Suppose that each X, is Bernoulli with parameter pi, and that pi,...,pn are 
chosen so that the mean of X is a given yz > 0. Show that the variance of X is 
maximized if the p; are chosen to be all equal to p/n. 


(b) Suppose that each X; is geometric with parameter pi, and that pi,...,pn are 
chosen so that the mean of X is a given p > 0. Show that the variance of X 
is minimized if the p; are chosen to be all equal to n/p. [Note the strikingly 
different character of the results of parts (a) and (b).] 


Solution. (a) We have 


n 


var(X) = X var(X:) = X p(l -p)-nu- b3 


1-1 


Thus maximizing the variance is equivalent to minimizing Pom p2. It can be seen 


(using the constraint Mu: pi ^ p) that 


Sop =Y (u/n)? eM (pi - u/n}, 


so wer p? is minimized when p, = p/n for all i. 


(b) We have 


and 





Introducing the change of variables y, = 1/p; = E[Xi]. we see that the constraint 


becomes A 
i=1 


and that we must minimize 


$3 uu -1) = Do? - a, 
i=1 i-1 


subject to that constraint. This is the same problem as the one of part (a), so the 
method of proof given there applies. 


Problem 46.* Entropy and uncertainty. Consider a random variable X that can 
take n values. z1...., £n, with corresponding probabilities p1,..., pa. The entropy of 
X is defined to be 


A(X) =—- y» log p. 
i-1 
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(All logarithms in this problem are with respect to base two.) The entropy H(X) 
provides a measure of the uncertainty about the value of X. To get a sense of this, note 
that H(X) > 0 and that H(X) is very close to 0 when X is “nearly deterministic,” 
i.e, takes one of its possible values with probability very close to 1 (since we have 
plog p ~ 0 if either p z 0 or p z 1). 

The notion of entropy is fundamental in information theory, which originated 
with C. Shannon's famous work and is described in many specialized textbooks. For 
example. it can be shown that H(X) is a lower bound to the average number of yes-no 
questions (such as “is X = r1?" or “is X < r5?") that must be asked in order to deter- 
mine the value of X. Furthermore, if k is the average number of questions required to 
determine the value of a string of independent identically distributed random variables 
Xi, X2,.... Xn. then. with a suitable strategy. k/n can be made as close to H(X) as 
desired, when n is large. 


(a) Show that if qi..... qn are nonnegative numbers such that PRAES qi = 1, then 
nr 
T 5 Pi log qi, 
i-l 
with equality if and only if p; = qi for all i As a special case, show that 


H(X) < logn, with equality if and only if p, = 1/n for all i Hint: Use the 
inequality In à. € a — 1, for a > 0. which holds with equality if and only if a = 1; 
here In a stands for the natural logarithm. 


(b) Let X and Y be random variables taking a finite number of values, and having 
joint PMF px.v (zr, y). Define 


(X,Y) 22: z, y) log (pen). 


Show that I(X,Y) > 0, and that I(X.Y) = 0 if and only if X and Y are 
independent. 


(c) Show that 
I(X.Y) = H(X)+ H(Y) - H(X,Y), 


where 
H(X.Y)2—- x: ox. y)logpx.v (T.y), 


T y 


== Des ) log px (z). =~ Lely ) log pv (y). 


(d) Show that 
(X.Y) = H(X)- H(X|Y), 


where 


H(X|Y)= ~ 2 m) 2 px rl) log px iv (z | y). 


[Note that H(X | Y ) may be viewed as the conditional entropy of X given Y, that 
is. the entropy of the conditional distribution of X. given that Y — y, averaged 
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over all possible values y. Thus. the quantity J(X,Y) = H(X) — H(X |Y) is the 
reduction in the entropy (uncertainty) on X, when Y becomes known. It can be 
therefore interpreted as the information about X that is conveyed by Y. and is 
called the mutual information of X and Y.] 


Solution. (a) We will use the inequality Ina € a— 1. (To see why this inequality is true. 
write Ina = f? 87! d < f? dB =a — 1 for a > 1. and write na = — f; 87! dp < 


-fi dB =a — 1l for0<a< 1. 
We have 


di qi 
— ð pilnpi+)> ping, = 5 pln (2) < Di (s -1) = 
D RR I 
with equality if and only if p, = q for all i. Since In p = log pln 2. we obtain the desired 


relation H(X) € — P pilogq,. The inequality H(X) < logn is obtained by setting 
qi = 1/n for all i. 


(b) The numbers px (z)pv (y) satisfy $7, 25 px(z)pv(y) = 1. so by part (a), we have 
Y pxv(z. y) log(px.y (2, 9)) 2 9 > px. (a. y) log(px (z)pv (y)), 


with equality if and only if 
px.v(z,y) = px (x)pv (y), for all z and y, 


which is equivalent to X and Y being independent. 
(c) We have 


I(X,Y) 2 M V pxy (2 y)logpx.v(z.y) - 9 9 px.y (z, y) g(px (z)pv (y)), 


r y T y 
and 


do do px v G-y) log px,y (x.y) = -HQGY). 


T y 


- 2,2. v (x,y) log(px (z)pv (y)) E v (z, y) log px(z) 
- Se y) log py (y) 
=a X px(2) logpx(x) — 3» log py (y) 


= H(X) + H(Y). 


Combining the above three relations, we obtain (X,Y) = H(X) 4- H(Y) — H(X.Y ). 
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(d) From the calculation in part (c), we have 


I(X,Y) = M V px.v(z.y)logpx.v(z, y) — 9 px (z) log px (2) 
pa p» » px,y (T, y) log py (y) 
= H(X)+ 2: 2 nx v (x, y) log (um 


= H(X)+ Y Y» pxiv (z |y) logpxiv (z| y) 


T y 


= H(X) - H(X|Y). 
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Random variables with a continuous range of possible values are quite common; 
the velocity of a vehicle traveling along the highway could be one example. If the 
velocity is measured by a digital speedometer, we may view the speedometer's 
reading as a discrete random variable. But if we wish to model the exact velocity, 
a continuous random variable is called for. Models involving continuous random 
variables can be useful for several reasons. Besides being finer-grained and pos- 
sibly more accurate, they allow the use of powerful tools from calculus and often 
admit an insightful analysis that would not be possible under a discrete model. 

All of the concepts and methods introduced in Chapter 2, such as expec- 
tation, PMFs, and conditioning, have continuous counterparts. Developing and 
interpreting these counterparts is the subject of this chapter. 


CONTINUOUS RANDOM VARIABLES AND PDFS 


A random variable X is called continuous if there is a nonnegative function fx, 
called the probability density function of X, or PDF for short, such that 


P(X e B)= | fxiz)dz. 


for every subset B of the real line. In particular, the probability that the value 
of X falls within an interval is 


b 
P(a < X « b) =f fx(z) dz, 


and can be interpreted as the area under the graph of the PDF (see Fig. 3.1). 
For any single value a, we have P(X =a) = s fx(x)dz = 0. For this reason, 
including or excluding the endpoints of an interval has no effect on its probability: 


P(a< X «b Plac X <b)=P(a< X «b) 2 P(a« X <b). 


Note that to qualify as a PDF, a function fx must be nonnegative, i.e., 
fx (x) 2 0 for every x, and must also have the normalization property 


T fx(r)dr = P(~œ < X < oc) 2 1. 


f The integral pe fx(z)dz is to be interpreted in the usual calculus/Riemann 
sense and we implicitly assume that it is well-defined. For highly unusual functions 
and sets, this integral can be harder - or even impossible - to define, but such issues 
belong to a more advanced treatment of the subject. In any case, it is comforting 
to know that mathematical subtleties of this type do not arise if fx is a piecewise 
continuous function with a finite or countable number of points of discontinuity, and 
B is the union of a finite or countable number of intervals. 
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Sainple space 














Event (a X« b) 


Figure 3.1: Illustration of a PDF. The probability that X takes a value in an 
interval [a, b] is i fx (z) dz. which is the shaded area in the figure. 


Graphically, this means that the entire area under the graph of the PDF must 
be equal to 1. 

To interpret the PDF, note that for an interval [x.x + 6] with very small 
length ó, we have 


r-4d-ó 


P([r,z + 6]) = / Pye tad, 


T 


so we can view fx(xz) as the “probability mass per unit length" near z (cf. 
Fig. 3.2). It is important to realize that even though a PDF is used to calculate 
event probabilities, fx(r) is not the probability of any particular event. In 
particular, it is not restricted to be less than or equal to one. 


PDF fyir) Figure 3.2: Interpretation of the 
PDF fx (zx) as "probability mass per 
unit length" around z. If 6 is very 
small, the probability that X takes 
a value in the interval [r, x + 6] is 
the shaded area in the figure, which 
is approximately equal to fx (x) - ô. 











Example 3.1. Continuous Uniform Random Variable. À gambler spins a 
wheel of fortune, continuously calibrated between 0 and 1. and observes the resulting 
number. Assuming that any two subintervals of [0,1] of the same length have the 
same probability, this experiment can be modeled in terms of a random variable X 
with PDF 

ifO0<2<1, 

otherwise, 


pied (à 
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Figure 3.3: The PDF of a uni- 
form random variable. 





for some constant c. This constant can be determined by using the normalization 


property 
oo 1 i 
if fxiyas- f caz=e f dz — c, 
—oo 0 0 
so that c — 1. 


More generally, we can consider a random variable X that takes values in 
an interval [a,b], and again assume that any two subintervals of the same length 
have the same probability. We refer to this type of random variable as uniform or 
uniformly distributed. Its PDF has the form 





fx (a) [rs if a € zx € b, 
X(T) = E 


0, otherwise, 


(cf. Fig. 3.3). The constant value of the PDF within [a,b] is determined from the 
normalization property. Indeed, we have 


oo b 1 
i8 fime - f p.d 


Example 3.2. Piecewise Constant PDF. Alvin's driving time to work is be- 
tween 15 and 20 minutes if the day is sunny, and between 20 and 25 minutes if the 
day is rainy, with all times being equally likely in each case. Assume that a day 
is sunny with probability 2/3 and rainy with probability 1/3. What is the PDF of 
the driving time, viewed as a random variable X? 

We interpret the statement that "all times are equally likely" in the sunny 
and the rainy cases, to mean that the PDF of X is constant in each of the intervals 
[15, 20] and [20,25]. Furthermore, since these two intervals contain all possible 
driving times, the PDF should be zero everywhere else: 


co, 1f15 € x < 20, 
fx(z)=4 c2, if20<2< 25, 





0, otherwise, 


where cı and c2 are some constants. We can determine these constants by using 
the given probabilities of a sunny and of a rainy day: 


) 20 20 
3c P (sunny day) = Jx(r)dr— / cı dz = 561, 
1 


15 5 
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Figure 3.4: A piecewise constant PDF involving three intervals. 


1 25 25 
= = P(rainy day) = i. f{x(z) dz = p co dx = 563, 
3 20 20 
so that 
E 0] 
a = 15' C2 — 15 


Generalizing this example. consider a random variable X whose PDF has the 
piecewise constant form 


fx) - [or ifa, €r«aji,, 2=1,2,...,n—-1, 
0, otherwise, 
where @),@2,...,@n are some scalars with a, < ai4| for all ?, and ci,c2,...,Cn are 


some nonnegative constants (cf. Fig. 3.4). The constants c; may be determined 
by additional problem data, as in the preceding driving context. Generally, the c; 
must be such that the normalization property holds: 


n-] 


an n-i ang 
r= | fx(z)az= S | cidz = X ex(ai41 — a3). 
21 i-i 79i 


i=l 


Example 3.8. A PDF Can Take Arbitrarily Large Values. Consider a 
random variable X with PDF 


1 R 
OREG BOSES 


0. otherwise. 


Even though fx (z) becomes infinitely large as r approaches zero, this is still a valid 


PDF, because 
oo 1 1 1 
r)dr- — dr = yr| = 1. 
f_o] ze-e4p 
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Summary of PDF Properties 


Let X be a continuous random variable with PDF fx. 
e fx(r) > 0 for all z. 


i [_ tx(e)ae = 1. 


e If ô is very small, then P([r, z + 0]) © fx (z) - à. 


e For any subset B of the real line, 


P(X € B) = fixada. 





Expectation 


The expected value or expectation or mean of a continuous random variable 
X is defined by! 


E[X] = [a zfx(x) dz. 


This is similar to the discrete case except that the PMF is replaced by the PDF, 
and summation is replaced by integration. As in Chapter 2. E[X] can be inter- 
preted as the “center of gravity" of the PDF and, also. as the anticipated average 
value of X in a large number of independent repetitions of the experiment. Its 
mathematical properties are similar to the discrete case - after all, an integral 
is just a limiting forin of a sum. 

If X is a continuous random variable with given PDF. any real-valued 
function Y = g(X) of X is also a random variable. Note that Y can be a 
continuous random variable: for example, consider the trivial case where Y — 
g( X) 2 X. But Y can also turn out to be discrete. For example. suppose that 


t One has to deal with the possibility that the integral Js rfx(r)dz is infi- 
nite or undefined. More concretely. we will say that the expectation is well-defined if 
gu. |z| fx (z) dz < oc. In that case, it is known that the integral T rfx(r)dzr takes 
a finite and unambiguous value. 

For an example where the expectation is not well-defined. consider a random 
variable X with PDF fx(x) = c/(1-- x?), where c is a constant chosen to enforce the 
normalization condition. The expression |r|fx(r) can be approximated by c/|z| when 
|z| is large. Using the fact ft a/z) dz = oc. one can show that Jie Iz| fx (xz) dz = oc. 
Thus. E[X] is left undefined. despite the symmetry of the PDF around zero. 

Throughout this book. in the absence of an indication to the contrary, we implic- 
itly assume that the expected value of any random variable of interest is well-defined. 
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g(x) = 1 for x > 0. and g(x) = 0. otherwise. Then Y = g(X) is a discrete 
random variable taking values in the finite set (0.1). In either case. the mean 
of g( X) satisfies the expected value rule 


E[9(X)] = f "dardaia 


in complete analogy with the discrete case; see the end-of-chapter problems. 
The nth moment of a continuous random variable X is defined as E[X"]. 
the expected value of the random variable X^. The variance. denoted by 


var( X ). is defined as the expected value of the random variable (X — E(X p^. 
We now summarize this discussion and list a number of additional facts 
that are practically identical to their discrete counterparts. 


Expectation of a Continuous Random Variable and its Properties 
Let X be a continuous random variable with PDF fx. 


e The expectation of X is defined by 
E[X] = f. z f x (x) dz. 
e The expected value rule for a function g(.X) has the form 
Blo(X)] = f stis) dz 


e The variance of X is defined by 


oo 


var(X) = E[(X - E[X])] = " (z — E[X])? fx (x) dz. 


— oo 


e We have ] 
0 € var(X) = E[X?] - (E[X]) . 


e If Y = aX +b, where a and b are given scalars, then 


E[Y] = aE[X] + b, var(Y) = a?var( X). 








Example 3.4. Mean and Variance of the Uniform Random Variable. 
Consider a uniform PDF over an interval [a.b]. as in Example 3.1. We have 


x eb 
mx - f «fede = | rp de 


ox a 
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2 ? 


as one expects based on the symmetry of the PDF around (a+ 5)/2. 
To obtain the variance, we first calculate the second moment. We have 














i 1 dup 5-4 
" b-a 3 l|. 3(b—a) 
a? + ab + b? 
3 


Thus, the variance is obtained as 


2 a*+ab+b? (a+b?  (b—a) 


var(X) = E[X?] - (E[X]) 3 4 12 ’ 


after some calculation. 


Exponential Random Variable 


An exponential random variable has a PDF of the form 


koe [67s iz 20, 
aae otherwise, 


where A is a positive parameter characterizing the PDF (see Fig. 3.5). This is a 
legitimate PDF because 


f fxz)dz = | Ae dr = —e—|^ = 1, 
—0o0 0 


0 





Note that the probability that X exceeds a certain value decreases exponentially. 
Indeed, for any a > 0, we have 
9o oo 
P(X >a)= n Ae? dz = —e7àt| = e~ Aa, 
a 


a 





An exponential random variable can, for example, be a good model for 
the amount of time until an incident of interest takes place, such as a message 
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M small X large À 





Figure 3.5: The PDF Ae^?*? of an exponential random variable. 


arriving at a computer, some equipment breaking down, a light bulb burning 
out, an accident occurring, etc. We will see that it is closely connected to the 
geometric random variable, which also relates to the (discrete) time that will 
elapse until an incident of interest takes place. The exponential random variable 
will also play a major role in our study of stochastic processes in Chapter 6, but 
for the time being we will simply view it as à special random variable that is 
fairly tractable analytically. 
'The mean and the variance can be calculated to be 


E[X] = x var(X) = —. 


These formulas can be verified by straightforward calculation, as we now show. 
We have, using integration by parts, 


E[X] =| rÀe-^* dz 
0 





oo oo 
= (-ze-^) «f e-?^z dz 
0 0 
Az 199 
=0- e | 
A lo 
_ 1 
Eu 


Using again integration by parts, the second moment is 
oo 
E[X?] - f x2de~ >t dr 
0 


oo oo 
+ i 2re-?^* dr 
0 0 
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Finally, using the formula var(X) = E[X?] — (E[X])^. we obtain 


Example 3.5. The time until a small meteorite first lands anywhere in the Sahara 
desert is modeled as an exponential random variable with a mean of 10 days. The 
time is currently midnight. What is the probability that a meteorite first lands 
some time between 6 a.m. and 6 p.m. of the first day? 

Let X be the time elapsed until the event of interest, measured in days. 
Then, X is exponential, with mean 1/A = 10, which yields A = 1/10. The desired 
probability is 


P(1/4 < X € 3/4) = P(X > 1/4) - P(X > 3/4) = e-1/*9 — e 3/49 — 9.9476, 


where we have used the formula P(X > a) = P(X »a)-e^?*. 


3.2 CUMULATIVE DISTRIBUTION FUNCTIONS 


We have been dealing with discrete and continuous random variables in a some- 
what different manner. using PMFs and PDFs, respectively. It would be desir- 
able to describe all kinds of random variables with a single mathematical concept. 
This is accomplished with the cumulative distribution function, or CDF for 
short. The CDF of a random variable X is denoted by Fx and provides the 
probability P(X < x). In particular, for every r we have 


» px (Kk), if X is discrete, 


ker 


f fx (t) dt, if X is continuous. 


Loosely speaking, the CDF Fx (x) “accumulates” probability “up to” the value z. 

Any random variable associated with a given probability model has a CDF, 
regardless of whether it is discrete or continuous. This is because (X < zx} is 
always an event and therefore has a well-defined probability. In what follows, any 
unambiguous specification of the probabilities of all events of the form {X < x), 
be it through a PMF, PDF, or CDF. will be referred to as the probability law 
of the random variable X. 

Figures 3.6 and 3.7 illustrate the CDFs of various discrete and continuous 
random variables. From these figures, as well as from the definition, some general 
properties of the CDF can be observed. 
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Figure 3.6: CDFs of some discrete random variables. The CDF is related to the 
PMF through the formula 


Fx(z) = P(X € z) = V px (5) 
ker 


and has a staircase form. with jumps occurring at the values of positive probability 
mass. Note that at the points where a jump occurs. the value of Fx is the larger 
of the two corresponding values (i.e.. Fx is continuous from the right). 


PDF fyiz) 
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Figure 3.7: CDFs of some continuous random variables. The CDF is related to 
the PDF through the formula 

r 

Fx(z) = P(X € 1) - J [x (t) at. 

-x 

Thus. the PDF fx can be obtained from the CDF by differentiation: 
dFx 
fx(z) = ——(z). 


dz 
For a continuous random variable, the CDF has no jumps, i.e.. it is continuous. 
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Properties of a CDF 
The CDF Fx of a random variable X is defined by 


Fx (rz) = P(X < 2), for all z, 


and has the following properties. 


e Fx is monotonically nondecreasing: 
if z < y, then Fx(z) € Fx(y). 


Fx (z) tends to 0 as r — —oo, and to 1 as z — oo. 
If X is discrete, then Fx (x) is a piecewise constant function of z. 
If X is continuous, then Fx (z) is a continuous function of z. 


If X is discrete and takes integer values, the PMF and the CDF can 
be obtained from each other by summing or differencing: 


k 
Fx(k) M. px(i), 


i—-—oo 


px(k) = P(X € k) - P(X < k- 1) = Fx (k) — Fx(k — 1), 


for all integers k. 


If X is continuous, the PDF and the CDF can be obtained from each 
other by integration or differentiation: 


Fem felat, fx) XQ. 


(The second equality is valid for those z at which the PDF is contin- 
uous.) 





Sometimes, in order to calculate the PMF or PDF of a discrete or contin- 
uous random variable. respectively. it is more convenient to first calculate the 
CDF. The systematic use of this approach for functions of continuous random 
variables will be discussed in Section 4.1. The following is a discrete example. 


Example 3.6. The Maximum of Several Random Variables. You are al- 
lowed to take a certain test three times. and your final score will be the maximum 
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of the test scores. Thus, 
X = max(Xi, X2, X3}, 


where X;, X2. X3 are the three test scores and X is the final score. Assume that 
your score in each test takes one of the values from 1 to 10 with equal probability 
1/10, independently of the scores in other tests. What is the PMF px of the final 
score? 

We calculate the PMF indirectly. We first compute the CDF Fx and then 
obtain the PMF as 


px(k) 2 Fx(k) — Fx(k — 1), EL. 10. 
We have 
Fx(k) = P(X € k) 
= P(X € k, X2 < k, X3 < k) 
= P(X, € k)P(X2 < k) P(Xs < k) 


k 3 
Si)" 
where the third equality follows from the independence of the events (X; < k}, 
(X2 € k}, (X3 € k}. Thus, the PMF is given by 


pee (+) e (s k=1.....10. 


The preceding line of argument can be generalized to any number of random 
variables X;..... Xn. In particular, if the events (Xi € z)...., {Xn < x) are 
independent for every z, then the CDF of X = max(Xi,.... Xn} is 


Fx (x) = Fx, (x) -- Fx, (x). 


From this formula, we can obtain px (x) by differencing (if X is discrete), or fx (x) 
by differentiation (if X is continuous). 


The Geometric and Exponential CDFs 


Because the CDF is defined for any type of random variable, it provides a conve- 
nient means for exploring the relations between continuous and discrete random 
variables. A particularly interesting case in point is the relation between geo- 
metric and exponential random variables. 

Let X be a geometric random variable with parameter p; that is, X is the 
number of trials until the first success in a sequence of independent Bernoulli 
trials, where the probability of success at each trial is p. Thus, for k = 1,2,..., 
we have P(X = k) = p(1 — p)*-! and the CDF is given by 


152 General Random Variables Chap. 3 


Suppose now that X is an exponential random variable with parameter 
à > 0. Its CDF is given by 


Pasty = P(X < z)=0, for z < 0, 


and A 
I 
Feyp(z) = f Ae-^Mdt = —e-^ "2 l- em^, for z > 0. 
0 


Exponential CDF 1-67% 





Geometric CDF: 1 - (1 - p) with p» 1 - «7^8 


Figure 3.8: Relation of the geometric and the exponential CDFs. We have 


Fexp( nó) = Feo(n), n= 12234 


where 6 is chosen so that e~** = 1—p. As 6 approaches 0, the exponential random 
variable can be interpreted as a “limit” of the geometric. 


To compare the two CDFs above, let us define 6 = — In(1 — p)/A, so that 
er = 1—p. 


Then, we see that the values of the exponential and the geometric CDFs are 
equal whenever x = nó, with n = 1.2...., i.e., 


Fexp(nô) = Fzeo(n), nc ey eae 


and are close to each other for other values of r (see Fig. 3.8). Suppose now 
that we toss very quickly (every 6 seconds, where 6 < 1) a biased coin with 
a very small probability of heads (equal to p = 1 — e-^). Then, the first 
time to obtain a head (a geometric random variable with parameter p) is a close 
approximation to an exponential random variable with parameter A, in the sense 
that the corresponding CDFs are very close to each other, as shown in Fig. 3.8. 
This relation between the geometric and the exponential random variables will 
play an important role when we study the Bernoulli and Poisson processes in 
Chapter 6. 
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3.3 NORMAL RANDOM VARIABLES 


À continuous random variable X is said to be normal or Gaussian if it has a 
PDF of the form (see Fig. 3.9) 


e-(z-u) /26?. 





fx(z) = = 


where p and c are two scalar parameters characterizing the PDF. with o assumed 
positive. It can be verified that the normalization property 


l J e-(z-p)?/20?° dr = | 
27 0 





holds (see the end-of-chapter problems). 


-1 0 1 2 3 x 
Normal PDF fy(x) Normal CDF Fyiz) 





Figure 3.9: A normal PDF and CDF. with u = 1 and c? = 1. We observe that 
the PDF is symmetric around its mean yp, and has a characteristic bell shape. 
As z gets further froin p. the term e7 (3-4*/20?. decreases very rapidly. In this 
figure, the PDF is very close to zero outside the interval [—1.3]. 


The mean and the variance can be calculated to be 
E[X] = u. var( X) = a2. 
To see this, note that the PDF is symmetric around yg. so the mean can only be 
p. Furthermore, the variance is given by 


] 2/52 
vamo / 


Using the change of variables y = (z — y)/o and integration by parts, we have 


var(X) = 


x 


var( X) = — | yeth dy 


g? 2 
sa (ue vay] tel e~ v^/2 dy 
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The last equality above is obtained by using the fact 
1 me 2 
— ev /2dy =1, 
V 2T jr : 
which is just the normalization property of the normal PDF for the case where 
u=0 ando =1. 


A normal random variable has several special properties. The following 
one is particularly important and will be justified in Section 4.1. 


Normality is Preserved by Linear Transformations 


If X is a normal random variable with mean p and variance c?, and if a # 0, 
b are scalars, then the random variable 


Y =aX +b 


is also normal, with mean and variance 


E|Y] = au + b, var(Y) = a?c?. 





The Standard Normal Random Variable 


A normal random variable Y with zero mean and unit variance is said to be a 
standard normal. Its CDF is denoted by 6: 


1 d 2 
By) = PY <9) = PY «o = —— f e-t 7/2 dt. 


It is recorded in a table (given in the next page). and is a very useful tool 
for calculating various probabilities involving normal random variables; see also 
Fig. 3.10. 

Note that the table only provides the values of ®(y) for y > 0, because the 
omitted values can be found using the symmetry of the PDF. For example, if Y 
is a standard normal random variable, we have 


$(—0.5) = P(Y < -0.5) = P(Y > 0.5) 21- P(Y < 0.5) 
= 1 — (0.5) = 1 — .6915 = 0.3085. 


More generally, we have 


(—y) = 1 — (y), for all y. 
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| 00 01 02 .03 04 05 06 07 08 .09 


0.0 || .5000 .5040 .5080 .5120 .5160 .5199 .5239 .5279 .5319 .5359 
0.1 || .5398 .5438 .5478 .5517 .5557 .5596 .5636 .5675 .5714 .5753 
0.2 || .5793 .5832 .5871 .5910 .5948 .5987 .6026 .6064 .6103 .6141 
0.3 | .6179 .6217 .6255 .6293 .6331 .6368 .6406 .6443 .6480 .6517 
0.4 | .6554 .6591 .6628 .6664 .6700 .6736 .6772 .6808 .6844 .6879 





0.5 || .6915 .6950 .6985 .7019 .7054 .7088 .7123 .7157 .7190 .7224 
0.6 | .7257 .7291 .7324 .7357 .7389 .7422 .7454 .7486 .7517 .7549 
0.7 | .7580 .7611 .7642 .7673 .7704 .7734 .7764 .7794 .7823 .7852 
0.8 | .7881 .7910 .7939 .7967 .7995 .8023 .8051 .8078 .8106 .8133 
0.9 | .8159 .8186 .8212 .8238 .8264 .8289 .8315 .8340 .8365 .8389 


1.1 | .8643 .8665 .8686 .8708 .8729 .8749 .8770 .8790 .8810 .8830 
1.2 || .8849 .8869 .8888 .8907 .8925 .8944 .8962 .8980 .8997 .9015 
1.3 || .9032 .9049 .9066 .9082 .9099 .9115 .9131 .9147 .9162 .9177 
1.4 | .9192 .9207 .9222 .9236 .9251 .9265 .9279 .9292 .9306 .9319 


1.0 || .8413 .8438 .8461 .8485 .8508 .8531 .8554 .8577 .8599 .8621 


1.5 | .9332 .9345 .9357 .9370 .9382 .9394 .9406 .9418 .9429 .9441 
1.6 || .9452 .9463 .9474 .9484 .9495 .9505 .9515 .9525 .9535 .9545 
1.7 | .9554 .9564 .9573 .9582 .9591 .9599 .9608 .9616 .9625 .9633 
1.8 | .9641 .9649 .9656 .9664 .9671 .9678 .9686 .9693 .9699 .9706 
1.9 9713 .9719 .9726 .9732 .9738 .9744 .9750 .9756 .9761 .9767 


2.0 | .9772 .9778 .9783 .9788 .9793 .9798 .9803 .9808 .9812 .9817 
2.1 | .9821 .9826 .9830 .9834 .9838 .9842 .9846 .9850 .9854 .9857 
2.2 | .9861 .9864 .9868 .9871 .9875 .9878 .9881 .9884 .9887 .9890 
2.3 | .9893 .9896 .9898 .9901 .9904 .9906 .9909 .9911 .9913 .9916 
2.4 | .9918 .9920 .9922 .9925 .9927 .9929 .9931 .9932 .9934 .9936 





2.5 | .9938 .9940 .9941 .9943 .9945 .9946 .9948 .9949 .9951 .9952 
2.6 || .9953 .9955 .9956 .9957 .9959 .9960 .9961 .9962 .9963 .9964 
2.7 | .9965 .9966 .9967 .9968 .9969 .9970 .9971 .9972 .9973 .9974 
2.8 | .9974 .9975 .9976 .9977 .9977 .9978 .9979 .9979 .9980 .9981 
2.9 | .9981 .9982 .9982 .9983 .9984 .9984 .9985 .9985 .9986 .9986 





3.0 | .9987 .9987 .9987 .9988 .9988 .9989 .9989 .9989 .9990 .9990 
3.1 | .9990 .9991 .9991 .9991 .9992 .9992 .9992 .9992 .9993 .9993 
3.2 || .9993 .9993 .9994 .9994 .9994 .9994 .9994 .9995 .9995 .9995 
3.3 | .9995 .9995 .9995 .9996 .9996 .9996 .9996 .9996 .9996 .9997 
3.4 | .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9997 .9998 


The standard normal table. The entries in this table provide the numerical values 
of $(y) = P(Y < y), where Y is a standard normal random variable, for y between 0 
and 3.49. For example, to find ®(1.71), we look at the row corresponding to 1.7 and 
the column corresponding to 0.01. so that (1.71) = .9564. When y is negative. the 
value of ®(y) can be found using the formula ®(y) = 1 — $(—y). 
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Let X be a normal random variable with mean yp and variance o?. We 
“standardize” X by defining a new random variable Y given by 





Since Y is a linear function of X, it is normal. Furthermore, 


var( X) 


72 =]. 


E[Y] = -0,  var(Y)= 





E[|X|- 4 
oO 


Thus, Y is a standard normal random variable. This fact allows us to calculate 
the probability of any event defined in terms of X: we redefine the event in terms 
of Y, and then use the standard normal table. 
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Figure 3.10: The PDF 
i 2 
fv (v) Jos 
of the standard normal random variable. The corresponding CDF, which is de- 
noted by 9, is recorded in a table. 


Example 3.7. Using the Normal Table. The annual snowfall at a particular 
geographic location is modeled as a normal random variable with a mean of u = 60 
inches and a standard deviation of ø = 20. What is the probability that this year's 
snowfall will be at least 80 inches? 

Let X be the snow accumulation, viewed as a normal random variable, and 


let 
lox. X= 60 


gc 79-3 * 


be the corresponding standard normal random variable. We have 


Y 








X — 60 80 — 60 80 — 60 
domne pt = > 
20 20 ) P(Y 


P(X > 80) - P( > 


) = P(Y > 1) = 1-811), 


where ® is the CDF of the standard normal. We read the value ®(1) from the table: 


(1) = 0.8413, 
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so that 
P(X > 80) = 1 — (1) = 0.1587. 


Generalizing the approach in the preceding example, we have the following 
procedure. 


CDF Calculation for a Normal Random Variable 


For a normal random variable X with mean p and variance o?, we use a 
two-step procedure. 


(a) “Standardize” X, i.e., subtract u and divide by ø to obtain a standard 
normal random variable Y. 


(b) Read the CDF value from the standard normal table: 


P(X <a) =P (ŽE < 24) =p (v < 2#) =o (225). 











eo e 





Normal random variables are often used in signal processing and communi- 
cations engineering to model noise and unpredictable distortions of signals. The 
following is a typical example. 


Example 3.8. Signal Detection. A binary message is transmitted as a signal 
s, which is either —1 or +1. The communication channel corrupts the transmission 
with additive normal noise with mean p = 0 and variance o°. The receiver concludes 
that the signal —1 (or +1) was transmitted if the value received is < 0 (or > 0, 
respectively); see Fig. 3.11. What is the probability of error? 

An error occurs whenever —1 is transmitted and the noise N is at least 1 so 
that s-- N = —1 +N > 0, or whenever +1 is transmitted and the noise N is smaller 
than —1 so that s+ N — 1 4- N <Q. In the former case, the probability of error is 





P(N 21) 2 1- P(N < 1)=1- P (7E < 1—4) 


o 


In the latter case, the probability of error is the same, by symmetry. The value 
of ®(1/c) can be obtained from the normal table. For o = 1, we have ®(1/c) = 
®(1) = 0.8413, and the probability of error is 0.1587. 


Normal random variables play an important role in a broad range of proba- 
bilistic models. The main reason is that. generally speaking, they model well the 
additive effect of many independent factors in a variety of engineering, physical. 
and statistical contexts. Mathematically. the key fact is that the sum of a large 
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Figure 3.11: The signa! detection scheme of Example 3.8. The area of the 
shaded region gives the probability of error in the two cases where —1 and +1 
is transmitted. 


number of independent and identically distributed (not necessarily normal) ran- 
dom variables has an approximately normal CDF, regardless of the CDF of the 
individual random variables. This property is captured in the celebrated central 
limit theorem, which will be discussed in Chapter 5. 


3.4 JOINT PDFS OF MULTIPLE RANDOM VARIABLES 


We will now extend the notion of a PDF to the case of multiple random variables. 
In complete analogy with discrete random variables, we introduce joint, marginal, 
and later on, conditional PDFs. Their intuitive interpretation as well as their 
main properties parallel the discrete case. 

We say that two continuous random variables associated with the same 
experiment are jointly continuous and can be described in terms of a joint 
PDF fx.y if fx y is a nonnegative function that satisfies 


P((X.Y) € B) = J [ «venas, 
(z,y)eB 


for every subset B of the two-dimensional plane. The notation above means 
that the integration is carried over the set B. In the particular case where B is 
a rectangle of the form B = ((z.y)la Ez € b, c < y € d), we have 


d pb 
Paasx<hes¥sd=/ f fxviaydzas 
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Furthermore, by letting B be the entire two-dimensional plane, we obtain the 


normalization property 
oo oc 
/ I fx.v (7, y) dz dy = 1. 
=C y —oo 


To interpret the joint PDF, we let ô be a small positive number and consider 
the probability of a small rectangle. We have 


có até 
P(a< X <a+6,c< Y < c+) - [ / fx.y(z,y) dz dy ~ fxy(a,c)-62, 
c a 


so we can view fx y(a,c) as the "probability per unit area" in the vicinity of 
(a.c). 

The joint PDF contains all relevant probabilistic information on the random 
variables X, Y , and their dependencies. It allows us to calculate the probability 
of any event that can be defined in terms of these two random variables. As 
a special case, it can be used to calculate the probability of an event involving 
only one of them. For example, let A be a subset of the real line and consider 
the event (X € A). We have 


oo 
P(X € A) P(X € A and Y €(-o0,2)) = | 7 fx,y(z, y) dy dz. 
A J- 
Comparing with the formula 
P(X € A) = f fx(z)dz. 
A 


we see that the marginal PDF fx of X is given by 
f(z) = | favor. 
-æ 


Similarly, » 
fea f fxvio ar. 


Example 3.9. Two-Dimensional Uniform PDF. Romeo and Juliet have a 
date at a given time, and each will arrive at the meeting place with a delay between 
O and 1 hour (recall the example given in Section 1.2). Let X and Y denote the 
delays of Romeo and Juliet, respectively. Assuming that no pairs (x,y) in the unit 
square are more likely than others, a natural model involves a joint PDF of the 
form 

c. if0<2<landO<y <1, 


0, otherwise, 


fx.y(z,y) = { 
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where c is a constant. For this PDF to satisfy the normalization property 


oo oc 1 l 
/ ri fxr) de dy = | f cdz dy = 1, 
-X4x 0 0 


£x te 


we must have 


This is an example of a uniform joint PDF. More generally, let us fix some 
subset S of the two-dimensional plane. The corresponding uniform joint PDF on S 
is defined to be 


1 
fx.y(z.y) = 4 area of S" 
0. otherwise. 


if (x,y) € S, 
For any set A C S. the probability that (X,Y) lies in A is 


l fA 
e(a-) e 4) = | [ixriendrdy= Lors] fira SIS. 


(r.y)€A (x.y)€ A 


Example 3.10. We are told that the joint PDF of the random variables X and 
Y is a constant c on the set S shown in Fig. 3.12 and is zero outside. We wish to 
determine the value of c and the marginal PDFs of X and Y. 

The area of the set S is equal to 4 and, therefore, fx. y (z,y) = c = 1/4, for 
(x,y) € S. To find the marginal PDF fx(z) for some particular z, we integrate 
(with respect to y) the joint PDF over the vertical line corresponding to that z. 
The resulting PDF is shown in the figure. We can compute fy similarly. 








Figure 3.12: The joint PDF in Example 3.10 and the resulting marginal PDFs. 
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Example 3.11. Buffon's Needle.! This is à famous example, which marks the 
origin of the subject of geometric probability. that is, the analysis of the geometric 
configuration of randomly placed objects. 

A surface is ruled with parallel lines, which are at distance d from each other 
(see Fig. 3.13). Suppose that we throw a needle of length l on the surface at random. 
What is the probability that the needle will intersect one of the lines? 

















Figure 3.13: Buffon's needle. The 

length of the line segment between the 

midpoint of the needle and the point 

of intersection of the axis of the needle 

with the closest parallel line is z/sin 8. 

The needle will intersect the closest par- 
allel line if and only if this length is less 

than 1/2. 





We assume here that 1 < d so that the needle cannot intersect two lines 
simultaneously. Let X be the vertical distance from the midpoint of the needle to 
the nearest of the parallel lines. and let O be the acute angle formed by the axis 
of the needle and the parallel lines (see Fig. 3.13). We model the pair of random 
variables (X. O) with a uniform joint PDF over the rectangular set { (a. A|O<2< 


d/2. 0 < 9 < 1/2), so that 


A/(nd). if x € [0,d/2] and 8 € [0.7/2], 


0. otherwise. 


fx.e(z,0) = { 


As can be seen from Fig. 3.13, the needle will intersect one of the lines if and 
only if 


X< ; sino, 


so the probability of intersection is 


P(X < (1/2)sin O) = T fx.e(x,9) dx dO 


r«(i/2)sin8 


1 This problem was posed and solved in 1777 by the French naturalist Buffon. 
A number of variants of the problem have been investigated, including the case where 
the surface is ruled with two sets of perpendicular lines (Laplace, 1812); see the end- 
of-chapter problems. The problem has long fascinated scientists, and has been used 
as a basis for experimental evaluations of m (among others, it has been reported that 
a captain named Fox measured 7 experimentally using needles, while recovering from 
wounds suffered in the American Civil War). The internet contains several graphical 
simulation programs for computing 7 using Buffon's ideas. 
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4 [2 plt/2) sine 
= T | dz d0 
ae Jo 0 
w/2 


4 pw 
-u] z sind dé 


2l 
= za 0089) 





0 

2l 

"d 

The probability of intersection can be empirically estimated, by repeating the ex- 


periment a large number of times. Since it is equal to 2]/7d, this provides us with 
a method for the experimental evaluation of 7. 


Joint CDFs 


If X and Y are two random variables associated with the same experiment, we 
define their joint CDF by 


Fx y(z,y) = P(X <2, Y<y). 


As in the case of a single random variable, the advantage of working with the 
CDF is that it applies equally well to discrete and continuous random variables. 
In particular, if X and Y are described by a joint PDF fx,y, then 


To py 
Fx y (2. y) = P(X «z.Yzy)- / f fx.v (s.t) dt ds. 
— 00 Y 


Conversely, the PDF can be recovered from the CDF by differentiating: 


0? Fx.y 


fx.v (x, y) = aroy y). 


Example 3.12. Let X and Y be described by a uniform PDF on the unit square. 
The joint CDF is given by 


Fx.y(z,y) = P(X <2, Y < y) = Ty, forO<z,y <1. 
We then verify that 


Fry _ (zy) UE 
OrÓy (7, y) x OrÓy (2, y) =1= fx.v (2, y). 








for all (z, y) in the unit square. 
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Expectation 

If X and Y are jointly continuous random variables and g is some function, then 
Z — g(X,Y) is also a random variable. We will see in Section 4.1 methods for 


computing the PDF of Z, if it has one. For now, let us note that the expected 
value rule is still applicable and 


E(x) = f f oteufev(eu)aray 


As an important special case, for any scalars a, b, and c, we have 


E[a X + bY + c] = aE[X] + bE[Y] + c. 


More than Two Random Variables 


The joint PDF of three random variables X, Y , and Z is defined in analogy with 
the case of two random variables. For example, we have 


P((X,Y.Z) € B) = J [ [t2 ananas, 
(z,y,z)EB 


for any set B. We also have relations such as 
oo 
fxvimg m | fxvz(ey dz 
— Oo 


and 


TOE / j » E ee tar Hand 


The expected value rule takes the form 


oo oo oo 
E[9(X,Y,Z)] = i. | / RET 


and if g is linear, of the form aX + bY + cZ, then 

E[a X + bY + cZ] = aE[X] + »&E[Y] + cE[Z ]. 
Furthermore, there are obvious generalizations of the above to the case of more 
than three random variables. For example, for any random variables X1, X2,..., 


Xn and any scalars a),a@2....,@n, we have 


El|aiXi + a2X2 t --- + a4 X4] = aiE[Xi] + a2E[ X2] +--+ + a4E[X;.]. 
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Summary of Facts about Joint PDFs 
Let X and Y be jointly continuous random variables with joint PDF fx y. 


e The joint PDF is used to calculate probabilities: 


P(x,Yyes)o. ff fxv(au)dedy. 
(r,y)eB 


e The marginal PDFs of X and Y can be obtained from the joint PDF, 
using the formulas 


fx(z) = T fx.y (2. y) dy, fv(y) = / fx.v (x,y) dz. 


e The joint CDF is defined by Fx, y(z,y) = P(X € z, Y € y), and 
determines the joint PDF through the formula 

Q? Fx y 

OrOy 





fx,y(z.y) = (x,y), 


for every (x,y) at which the joint PDF is continuous. 


e A function g(.X, Y ) of X and Y defines a new random variable, and 


Epy) = f i f Ba edes. 


If g is linear, of the form aX + bY + c, we have 


E[a X + bY +c] = aE[X] + bE[Y] + c. 


e The above have natural extensions to the case where more than two 
random variables are involved. 





3.5 CONDITIONING 


Similar to the case of discrete random variables. we can condition a random 
variable on an event or on another random variable, and define the concepts 
of conditional PDF and conditional expectation. The various definitions and 
formulas parallel the ones for the discrete case, and their interpretation is similar, 
except for some subtleties that arise when we condition on an event of the form 
{Y = y), which has zero probability. 
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Conditioning a Random Variable on an Event 


The conditional PDF of a continuous random variable X, given an event A 
with P(A) > 0, is defined as a nonnegative function fx), that satisfies 


P(X e 814) - | fiio) de. 


for any subset B of the real line. In particular, by letting B be the entire real 
line, we obtain the normalization property 


/ fx|a(x) dz = 1. 


so that fx), is a legitimate PDF. 
In the important special case where we condition on an event of the form 
(X € A}, with P(X € A) > 0, the definition of conditional probabilities yields 


d 
P(X€B, XEA) Jans PIS) 


PA DRGAS T pee a, PEA 


By comparing with the earlier formula, we conclude that 


fx (x) 
fxixeay(z) = 4 P(X e A) 
0, otherwise. 


if r € A, 


As in the discrete case, the conditional PDF is zero outside the conditioning set. 
Within the conditioning set, the conditional PDF has exactly the same shape as 
the unconditional one, except that it is scaled by the constant factor 1/P(X € A), 
so that fx|(xeA) integrates to 1; see Fig. 3.14. Thus, the conditional PDF is 
similar to an ordinary PDF, except that it refers to a new universe in which the 
event (X € A] is known to have occurred. 


Example 3.13. The Exponential Random Variable is Memoryless. The 
time T' until a new light bulb burns out is an exponential random variable with 
parameter A. Ariadne turns the light on, leaves the room, and when she returns, t 
time units later. finds that the light bulb is still on. which corresponds to the event 
A = (T >t}. Let X be the additional time until the light bulb burns out. What 
is the conditional CDF of X. given the event A? 

We have, for x > 0. 


P(T»t- rand T >t) 


P(X >z|A)=P(T>t+2z|T>t)= PT >?) 


_P(T>t+2) _ e Mere) SM 
^. P(T»t) | e -. 
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fxipxca}(%) 


a“ 


Figure 3.14: The unconditional PDF fx and the conditional PDF fxitxeay, 
where A is the interval [a,b]. Note that within the conditioning event A, fxi(x eA) 
retains the same shape as fx, except that it is scaled along the vertical axis. 


where we have used the expression for the CDF of an exponential random variable 
derived in Section 3.2. 

Thus, the conditional CDF of X is exponential with parameter À, regardless 
of the time t that elapsed between the lighting of the bulb and Ariadne's arrival. 
This is known as the memorylessness property of the exponential. Generally, if we 
model the time to complete a certain operation by an exponential random variable 
X, this property implies that as long as the operation has not been completed, the 
remaining time up to completion has the same exponential CDF, no matter when 
the operation started. 


When multiple random variables are involved, there is à similar notion 
of a joint conditional PDF. Suppose, for example, that X and Y are jointly 
continuous random variables, with joint PDF fx,v. If we condition on a positive 
probability event of the form C = ((X,Y) € A}, we have 


fx.v (n, y) : 
fxvic(ry)-4 P(O ' if (x,y) € A, 


0, otherwise. 


In this case, the conditional PDF of X, given this event, can be obtained from 
the formula 


fxic(z) = T E 


These two formulas provide one possible method for obtaining the conditional 
PDF of a random variable X when the conditioning event is not of the form 
(X € A}, but is instead defined in terms of multiple random variables. 

We finally note that there is & version of the total probability theorem, 
which involves conditional PDFs: if the events ÀA1,..., A4 form a partition of 
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the sample space, then 
oc P(A;) fx|A, (2). 
i=l 


To justify this statement, we use the total probability theorem from Chapter 1, 
and obtain 


P(X < x) - DPA P(X € z| Ai). 


This formula can be rewritten as 


f. fa(t)dt= Pai) f. fxiA (t) dt 
ae i=] =e 


We then take the derivative of both sides, with respect to z, and obtain the 
desired result. 


Conditional PDF Given an Event 


e The conditional PDF fx,4 of a continuous random variable X, given 
an event A with P(A) > 0, satisfies 


P(X € BIA) = f fxia(2) dz 


e If A is a subset of the real line with P(X € A) > 0, then 


fx(z) 
fÍxuxeay(z) = P(X € A) , freA, 


0, otherwise. 


e Let Aj, A», ..., An be disjoint events that form a partition of the sam- 
ple space, and assume that P(Ai) > 0 for all i. Then, 


-Y PA i) f xia, (2) 


i=1 


(a version of the total probability theorem). 





The following example illustrates a divide-and-conquer approach that uses 
the total probability theorem to calculate a PDF. 
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Example 3.14. The metro train arrives at the station near your home every 
quarter hour starting at 6:00 a.m. You walk into the station every morning between 
7:10 and 7:30 a.m., and your arrival time is a uniform random variable over this 
interval. What is the PDF of the time you have to wait for the first train to arrive? 














u 1/10 


Figure 3.15: The PDFs fx. fyj,4- fyjB, and fy in Example 3.14. 


The time of your arrival, denoted by X, is a uniform random variable over 
the interval from 7:10 to 7:30; see Fig. 3.15(a). Let Y be the waiting time. We 
calculate the PDF fy using a divide-and-conquer strategy. Let A and B be the 
events 

A = {7:10 € X € 7:15} = (you board the 7:15 train}, 


B = {7:15 < X € 7:30} = {you board the 7:30 train). 


Conditioned on the event A. your arrival time is uniform over the interval from 7:10 
to 7:15. In this case, the waiting time Y is also uniform and takes values between 
0 and 5 minutes; see Fig. 3.15(b). Similarly, conditioned on B, Y is uniform and 
takes values between 0 and 15 minutes: see Fig. 3.15(c). The PDF of Y is obtained 
using the total probability theorem, 


fv(v) = P(A) fyjaly) + P(B)fvin(y). 


and is shown in Fig. 3.15(d). We have 


l1 1 3 d l 
A d aga. a <y<5, 
freM= Festa m o frosyss. 
1 3-3 1 
zi gu ee. d « 15. 
fr(y) 1 0 15 20 or5<y<15 
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Conditioning one Random Variable on Another 


Let X and Y be continuous random variables with joint PDF fx y. For any y 
with fy (y) > 0. the conditional PDF of X given that Y = y, is defined by 


_ fxy(z,y) 
fxiv (x | y) res fv (y) : 


This definition is analogous to the formula pxiy(r|y) = px.v(z.y)/py(y) for 
the discrete case. 

When thinking about the conditional PDF. it is best to view y as a fixed 
number and consider f x v (x | y) as a function of the single variable z. Viewed as 
a function of z, fxiy (x | y) has the same shape as the joint PDF fx.y (z, y), be- 
cause the denominator fy (y) does not depend on z: see Fig. 3.16. Furthermore. 
the formula 


fy(y) = a fxy(2.y)dr 


implies the normalization property 


f fxiyy(z|y)dr= 1. 


so for any fired y. fxiy (zly) is a legitimate PDF. 


à y 


4 M 








Figure 3.16: Visualization of the conditional PDF fxjy(z|y). Let X and Y 
have a joint PDF which is uniform on the set S. For each fixed y, we consider the 
joint PDF along the slice Y = y and normalize it so that it integrates to 1. 


Example 3.15. Circular Uniform PDF. Ben throws a dart at a circular target 
of radius r (see Fig. 3.17). We assume that he always hits the target, and that 
all points of impact (x.y) are equally likely. so that the joint PDF of the random 
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Figure 3.17: Circular target for 
Example 3.15. 





variables X and Y is uniform. Following Example 3.9, and since the area of the 
circle is mr”, we have 


ree ——, if (z,y) is in the circle, 
fx.v(z,y) = 4 area of the circle 
0, otherwise, 
l T ET 
—, ifr +y Er, 
— 2 mr? YS 
0, otherwise. 


To calculate the conditional PDF fx)y(z|y), let us first find the marginal 
PDF fy (y). For |y| > r, it is zero. For [y| € r, it is given by 


fv(y) 


Il 


f ime 


1 
NTO Sead pud eed 


í Eo 
-hf n% 
er 
2 
= 5 Vry, if |y| <r. 


"T 


Note that the marginal PDF fy is not uniform. 
The conditional PDF is 


1 





fx,v(z, y) nr? 1 PE RUE T. 
fxiv(z|y) = = = = i ifs ty <r’. 
fv(y) 2 Jal 2r? — y? 
wr? d y 


Thus, for a fixed value of y, the conditional PDF fyjy is uniform. 
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To interpret the conditional PDF, let us fix some small positive numbers 
6, and 62, and condition on the event B = (y € Y € y+ 62}. We have 
P(x < X <x+6, andy<Y €y-ó) 
P(r <X<zxr+ô |y <Y <y + de) PULY <y +s 
~ [xv (x, y)ó162 
fv (y)ô2 


= fxiv (z | y)à. 


In words, fxj|y(z|y)ói provides us with the probability that X belongs to a 
small interval [rj + ój], given that Y belongs to a small interval [y.y + ô2]. 
Since fxj|y(z|y)ói does not depend on 62, we can think of the limiting case 
where 62 decreases to zero and write 


Pir < X <2+6|Y = y) ~ fxıy (z | y)ô1, (6; small), 


and, more generally, 
PX eAlY =a) =f fxv (ely) dr. 


Conditional probabilities, given the zero probability event {Y = y}, were left 
undefined in Chapter 1. But the above formula provides a natural way of defining 
such conditional probabilities in the present context. In addition, it allows us to 
view the conditional PDF fx|y (x |y) (as a function of x) as a description of the 
probability law of X, given that the event (Y — y) has occurred. 

As in the discrete case, the conditional PDF fxjy, together with the 
marginal PDF fy are sometimes used to calculate the joint PDF. Furthermore, 
this approach can also be used for modeling: instead of directly specifying fx,y, 
it is often natural to provide a probability law for Y, in terms of a PDF fy, 
and then provide a conditional PDF fx|y (x | y) for X, given any possible value 
y of Y. 


Example 3.16. The speed of a typical vehicle that drives past a police radar is 
modeled as an exponentially distributed random variable X with mean 50 miles per 
hour. The police radar's measurement Y of the vehicle's speed has an error which 
is modeled as a normal random variable with zero mean and standard deviation 
equal to one tenth of the vehicle's speed. What is the joint PDF of X and Y? 

We have fx(z) = (1/50)e~7/"", for x > 0. Also, conditioned on X = z, the 
measurement Y has a normal PDF with mean z and variance z?/100. Therefore, 


-(y-2)?/(222 /100). 


1 
gt) = — 
Thus, for all z > 0 and all y, 


l .z/so —10 ,-50(-2))/2? 


fxy(z,y) = fx(x) fy|x(y| 2) = 50° Wire 
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Conditional PDF Given a Random Variable 
Let X and Y be jointly continuous random variables with joint PDF fx,y. 
e The joint, marginal, and conditional PDFs are related to each other 


by the formulas 


fxy(z,y) = fv(y)fxiv(z|v), 


ise = f A Mani Cle 


The conditional PDF fx|y (x |y) is defined only for those y for which 
fy (y) > 0. 
e We have 
P(X e A|Y =u) = | bevande. 





For the case of more than two random variables, there are natural exten- 
sions to the above. For example, we can define conditional PDFs by formulas 
such as 


_ fx,y.z(z,y, 2) 


fx yiz(z.y|z) = ——— ——. if fz(z) > 0, 
fz(z) 
fxiy.z(zly,z) = Byz) if fv.z(y, z) > 0. 


There is also an analog of the multiplication rule, 


fxy.z(x.y. 2) = fxiv.z(z |y. z)fviz(vl z)fz (2), 


and of other formulas developed in this section. 
Conditional Expectation 


For a continuous random variable X, we define its conditional expectation 
E|[X | A] given an event A, similar to the unconditional case, except that we now 
need to use the conditional PDF fx,4. The conditional expectation E[X | Y = y] 
is defined similarly, in terms of the conditional PDF fxjy. Various familiar 
properties of expectations carry over to the present context and are summarized 
below. We note that all formulas are analogous to corresponding formulas for 
the case of discrete random variables. except that sums are replaced by integrals, 
and PMFs are replaced by PDFs. 
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Summary of Facts About Conditional Expectations 


Let X snd Y be jointly continuous random variables, and let A be an event 
with P(A) > 0. 


e Definitions: The conditional expectation of X given the event A is 
defined by 


EXA] = f efxia(z)de 


The conditional expectation of X given that Y = y is defined by 
oo 


E[X |Y = y] = / Zfx|y (x | y) dz. 


—oo 


e The expected value rule: For a function g(X), we have 


ella] = f EA 


and ak 
E[g(X)|Y =y] =f g(x) fxiy (x | y) dz. 


e Total expectation theorem: Let A), A2,...,An be disjoint events 
that form a partition of the sample space, and assume that P(A;) > 0 
for all i. Then, 


E[X] = $` P(AJE[X | Ai). 


$1 
Similarly, m 
E[X] = f E[X|Y = y]fv (y) dy. 


e There are natural analogs for the case of functions of several random 
variables. For example, 


E[o (X. Y)|Y = y] = jJ g(z, y xiv (x | y) dz, 


and 


E[g(X. Y)] = J E[o(X, Y) 1Y = y]fy (y) dv. 








The expected value rule is established in the same manner as for the case of 
unconditional expectations. To justify the first version of the total expectation 
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theorem, we start with the total probability theorem 


n 


fx(z) = MY P(A)fxi (2), 


i-1 


multiply both sides by z, and then integrate from —oo to oo. 
To justify the second version of the total expectation theorem, we observe 
that 


n E[X | Y = y|fv(y) dy = rM ras efx (z lv) de! fy (y) dy 


266 oo oo 


: J - i : z fxiv (aly) fy (y) dz dy 


= [ [ntm 


m n dz 


z rfx(r)dr 


oo 


= E[X]. 


The total expectation theorem can often facilitate the calculation of the 
mean, variance, and other moments of a random variable, using a divide-and- 
conquer approach. 


Example 3.17. Mean and Variance of a Piecewise Constant PDF. Suppose 
that the random variable X has the piecewise constant PDF 


1/3, ifO<7<1, 
fx(r)-2 42/3, if1« x «2, 
0, otherwise, 


(see Fig. 3.18). Consider the events 


Ai = (x lies in the first interval (0, i}, 
A2 = (x lies in the second interval (1, 2]). 


We have from the given PDF, 
: 1 5 2 
P(Ai) - f fx(a)de= i, — P(A) =f fx(2)dz = 2. 
0 1 


Furthermore, the conditional mean and second moment of X, conditioned on Ai 
and Ag, are easily calculated since the corresponding conditional PDFs fx|4, and 
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Figure 3.18: Piecewise constant 
1/3 PDF for Example 3.17. 





fxja, are uniform. We recall from Example 3.4 that the mean of a uniform random 
variable over an interval (a, b] is (a + b)/2 and its second moment is (a? +ab+07)/3. 
Thus, 


E[X | Ai] = 5. E[X | A3] = 2: 


Qo EL? | A] = I. 


E[X?| Ai] = 3 


Ole whe 


We now use the total expectation theorem to obtain 


E[X] = P(A, )E[X | Ai] + P(A2)E[X | A2] = a l ; 


C2] t2 


+ 


1 


E[X?] = P(A, )E[X?| Ai] + P(A2)E[X? | A2) = 3 ; à : 


The variance is given by 


var(X) = E[X?] - (E[X]). = z - = = = 


Note that this approach to the mean and variance calculation is easily generalized 
to piecewise constant PDFs with more than two pieces. 


Independence 


In full analogy with the discrete case, we say that two continuous random vari- 
ables X and Y are independent if their joint PDF is the product of the marginal 
PDFs: 


fxy(z.y) = fx(z)fr(y), for all z, y. 


Comparing with the formula fx, y (x,y) = fxiy(zly)fy(y), we see that inde- 
pendence is the same as the condition 


fxiy (zly) = fx(z), for all y with fy(y) > 0 and all z, 
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or, symmetrically, 
fvix (uo) = fv(y). for all x with fx(x) > 0 and all y. 


There is a natural generalization to the case of more than two random vari- 
ables. For example, we say that the three random variables X, Y, and Z are 
independent if 


fx.v.z(z.y.z) = fx(z)£fv(y)fz(z). forall z,y,z. 


Example 3.18. Independent Normal Random Variables. Let X and Y 
be independent normal random variables with means pz, 4y. and variances "n ae 
respectively. Their joint PDF is of the form 





fx.y(z.y) = fx(x)fv(iy) = 





: exp { ( = uz) Ca EY 


21010, 202 202 


This joint PDF has the shape of a bell centered at (uz. uy). and whose width in 
the z and y directions is proportional to e; and oy. respectively. We can get some 
additional insight into the form of this PDF by considering its contours, i.e., sets 
of points at which the PDF takes a constant value. These contours are described 
by an equation of the form 


= constant. 





(z — pe)? + (y - uy)? 
ax oF 


and are ellipses whose two axes are horizontal and vertical (see Fig. 3.19). In the 
special case where ox = oy. the contours are circles. 











E 


vide 
(es Hy * 


Figure 3.19: Contours of the joint PDF of two independent normal random 


variables X and Y with means pr. uy. and variances o2, o, respectively. 
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If X and Y are independent, then any two events of the form (X € A) and 
(Y € B] are independent. Indeed, 


P(XeAandY € B) = f f fx. y(x, y) dy dz 
r€AJyeB 


£z J fx (z)fv (y)dydz 
rcAJycB 


- f. fx( (ade | fy (y) dy 


= P(X € A)P(Y € B). 
In particular, independence implies that 
Fx y(z.y) = P(X € xz, Y € y) - P(X € z) P(Y € y) = Fx (x)Fv (y). 


The converse of these statements is also true: see the end-of-chapter problems. 
The property 


Fx y (x. y) = Fx (x)Fy(y). for all x. y. 
can be used to provide a general definition of independence between two random 
variables. e.g., if X is discrete and Y is continuous. 


An argument similar to the discrete case shows that if X and Y are inde- 
pendent, then 


E[g( X)h(Y)] = E[g( X)] E[A(Y)]. 


for any two functions g and h. Finally. the variance of the sum of independent 
random variables is equal to the sum of their variances. 


Independence of Continuous Random Variables 
Let X and Y be jointly continuous random variables. 


e X and Y are independent if 
fxy(z.y) = fx(z)fv (y). for all z, y. 


e If X and Y are independent, then 


E[XY] = E[X] ElY]. 





178 General Random Variables Chap. 3 


Furthermore, for any functions g and h, the random variables g( X) 
and A(Y) are independent, and we have 


E[g( X)^(Y)] = E[g( X)] B[A(Y)]. 
e If X and Y are independent, then 


var( X + Y) = var( X) + var(Y ). 





3.6 THE CONTINUOUS BAYES' RULE 


In many situations, we represent an unobserved phenomenon by a random vari- 
able X with PDF fx and we make a noisy measurement Y, which is modeled 
in terms of a conditional PDF fy,y. Once the value of Y is measured. what 
information does it provide on the unknown value of X? This setting is similar 
to the one of Section 1.4, where we introduced Bayes' rule and used it to solve 
inference problems; see Fig. 3.20. The only difference is that we are now dealing 
with continuous random variables. 





Figure 3.20: Schematic description of the inference problem. We have an 
unobserved random variable X with known PDF, and we obtain a measurement 
Y according to a conditional PDF fy,x. Given an observed value y of Y, the 
inference problem is to evaluate the conditional PDF f xiy (z | y). 


Note that whatever information is provided by the event (Y — y) is cap- 
tured by the conditional PDF fx,y (z |y). It thus suffices to evaluate this PDF. 
From the formulas fx fy;x = fx.v = fv fxyy, it follows that 


| Jx(z)fvix(u|z) 
fxiv (zly) = Er RE 


Based on the normalization property f^^. fx|v(z|y)dz = 1, an equivalent ex- 
pression is 


fx (z)fvix(y|z) 


farali) = 
ies KOM Cle 
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Example 3.19. A light bulb produced by the General Illumination Company is 
known to have an exponentially distributed lifetime Y. However, the company has 
been experiencing quality control problems. On any given day, the parameter A of 
the PDF of Y is actually a random variable, uniformly distributed in the interval 
[1.3/2]. We test a light bulb and record its lifetime. What can we say about the 
underlying parameter A? 

We model the parameter A in terms of a uniform random variable A with 
PDF 

fad) =2, forl<A< 5, 


The available information about A is captured by the conditional PDF fajy (Aly), 
which using the continuous Bayes' rule, is given by 


duty fa(3fvi(u]3 |. 21e ^w l ven : 


v oc F 3/2 
/ fat) fria(y lt) at f ate" at 


Inference about a Discrete Random Variable 


In some cases, the unobserved phenomenon is inherently discrete. For some 
examples, consider a binary signal which is observed in the presence of normally 
distributed noise, or a medical diagnosis that is made on the basis of continuous 
measurements such as temperature and blood counts. In such cases, a somewhat 
different version of Bayes’ rule applies. 

We first consider the case where the unobserved phenomenon is described in 
terms of an event A whose occurrence is unknown. Let P(A) be the probability of 
event A. Let Y be a continuous random variable, and assume that the conditional 
PDFs fyja(y) and fyjac(y) are known. We are interested in the conditional 
probability P(A|Y = y) of the event A, given the value y of Y. 

Instead of working with the conditioning event (Y — y), which has zero 
probability, let us instead condition on the event (y € Y < y +ô}, where ô is 
a small positive number, and then take the limit as 6 tends to zero. We have, 
using Bayes’ rule, and assuming that fv (y) > 0, 


P(A|Y 2y)e& P(A|yEY <y+9) 
| P(A)P(y < Y <y+6[A) 
7 P(y< Y <y+0) 
~ PAMviA Q0 
fv (y)ó 
» P(A) fy a(y) 
fru) ` 


The denominator can be evaluated using the following version of the total prob- 
ability theorem: 


fy (y) = P(A) fyja(y) + P(A°) fy yac(y), 
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so that 


PAIY= 9 7 Bags y) + PAV) 


In a variant of this formula, we consider an event A of the form (N = n), 
where N is a discrete random variable that represents the different discrete 
possibilities for the unobserved phenomenon of interest. Let pw be the PMF 
of N. Let also Y be a continuous random variable which, for any given value n 
of N, is described by a conditional PDF fyn (y |n). The above formula becomes 


PIN =n |y = y) = POURRAI 


The denominator can be evaluated using the following version of the total prob- 
ability theorem: 


fy(y) = 2 Pulifyin(y | i). 


so that 
DN (n)f vu (y |n) 


P N =n Y = = Eaa 
| 4 2 pnli)fyin (li) 


Example 3.20. Signal Detection. A binary signal S is transmitted, and we are 
given that P(S = 1) = pand P(S = —1) = 1— p. The received signal is Y = N +S. 
where N is normal noise, with zero mean and unit variance, independent of S. What 
is the probability that S — 1, as a function of the observed value y of Y? 

Conditioned on S = s, the random variable Y has a normal distribution with 
mean s and unit variance. Applying the formulas given above, we obtain 


—(n—132 
D Q-(v-D^/2 


P(S-1lY-y- ps(l)fris(y]1) _ v2n 
fv (v) P Q-(-02/2 | 1 7 Po- ve?" 
v 2T v 2T 


which simplifies to 


pe" 


P(S =1|Y = y) = —————. 
(S7 IY TT Fears pet 

Note that the probability P(S — 1|Y — y) goes to zero as y decreases to —oo, 

goes to 1 as y increases to 00, and is monotonically increasing in between, which is 

consistent with intuition. 
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Inference Based on Discrete Observations 


We finally note that our earlier formula expressing P(A|Y = y) in terms of 
fv|A(y) can be turned around to yield 


frugi = GRAY = 9), 


Based on the normalization property yd fyja(y) dy = 1. an equivalent expres- 
sion is 
fv(y) P(A|Y =y) 


Jvia(9) = —Á 
n fv(t) P(A|Y = 9 dt 


This formula can be used to make an inference about a random variable Y when 
an event A is observed. There is a similar formula for the case where the event 
A is of the form {N = n}, where N is an observed discrete random variable that 
depends on Y in a manner described by a conditional PMF py;y (n | y). 





Bayes’ Rule for Continuous Random Variables 


Let Y be a continuous random variable. 


e If X is a continuous random variable, we have 


fv(y)fxiv(z|y) = fx(z)fvix(vl2), 
and 
fx(z)fyjx(ylz) | fx(z)fyix(y |z) 


Jxiv(r|y) = = — 5 ; 
fv(y) / f(t) frix(y|t) dt 


e If N is a discrete random variable, we have 
fy(y) P(N 2 n|Y = y) =pn(n)fyin(y|n), 
resulting in the formulas 


_ en(r)fyin(yl) _ pv(n)fviw(y|n) 


P(N =nl¥ =y) fry) SS pw (ifr (uli) 
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and 


_ Jv(y) P(N -n|Y =y) | fv(y)P(N-n|Y =y) 
fviw(y| n) = prin) = —*66 . 
Pu (n / fy(t)P(N =n|Y - t)dt 


e There are similar formulas for P(A|Y = y) and fy; ,(y)- 





3.7 SUMMARY AND DISCUSSION 


Continuous random variables are characterized by PDFs, which are used to cal- 
culate event probabilities. This is similar to the use of PMFs for the discrete case, 
except that now we need to integrate instead of summing. Joint PDFs are similar 
to joint PMFs and are used to determine the probability of events that are de- 
fined in terms of multiple random variables. Furthermore, conditional PDFs are 
similar to conditional PMFs and are used to calculate conditional probabilities, 
given the value of the conditioning random variable. An important application is 
in problems of inference, using various forms of Bayes’ rule that were developed 
in this chapter. 

There are several special continuous random variables which frequently 
arise in probabilistic models. We introduced some of them, and derived their 
mean and variance. A summary is provided in the table that follows. 





Summary of Results for Special Random Variables 


Continuous Uniform Over [a,b]: 








1 
——, ifa<r<b 
fx(2)9-4b-a^ 575^ 
0, otherwise, 
a+b _ (b-a) 
E[X] = z` var(X) = 2 
Exponential with Parameter A: 
Ae*, ifx>0, | [1-e-?, ifzr0, 
fx(z) = { 0, otherwise, Exe) = { 0, otherwise, 


E[X] = x var(X) = EN 
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Normal with Parameters p and o? > 0: 
1 


fx(z) = Jone 
E[X] = y, var( X) = o?. 


e-(z-uy [262 








We have also introduced CDFs, which can be used to characterize general 
random variables that are neither discrete nor continuous. CDFs are related to 
PMFs and PDFs, but are more general. For a discrete random variable, we can 
obtain the PMF by differencing the CDF; for a continuous random variable, we 
can obtain the PDF by differentiating the CDF. 
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PROBLEMS 


SECTION 3.1. Continuous Random Variables and PDFs 


Problem 1. Let X be uniformly distributed in the unit interval [0,1]. Consider the 
random variable Y — g(X), where 


y= f1 fe <13, 
97/7 12, ifs > 1/3. 


Find the expected value of Y by first deriving its PMF. Verify the result using the 
expected value rule. 


Problem 2. Laplace random variable. Let X have the PDF 
À -Alz 
fx (2) = 5e, 


where À is a positive scalar. Verify that fx satisfies the normalization condition, and 
evaluate the mean and variance of X. 


Problem 3.* Show that the expected value of a discrete or continuous random vari- 
able X satisfies 


exj- f Pa» aar- [ PX< —z) dz. 


Solution. Suppose that X is continuous. We then have 


f ra > 2 dr a (a) dr 
0 0 x 
- f (/ Ix(v) de) dy 
- f fx(y) (f ae) dy 
0 0 


= f yfx(y) dy, 
0 


where for the second equality we have reversed the order of integration by writing the 
set {(z.y)|0 < z< æ. r<y<oc}as {(z.y)|0<zr<y, O<y< ow}. Similarly. we 
can show that 


/ P(X < -z)dz = -J yfx (y)dy. 
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Combining the two relations above, we obtain the desired result. 
If X is discrete, we have 


= So px(y)y, 


y»0 


and the rest of the argument is similar to the continuous case. 


Problem 4.* Establish the validity of the expected value rule 
oc 
E[g(X)] = | E 
where X is a continuous random variable with PDF fx. 
Solution. Let us express the function g as the difference of two nonnegative functions, 
g(x) = g* (x) - g` (2). 
where gt (r) = max{g(z).0}. and g^ (z) = max{—g(z).0}. In particular. for any t > 0, 


we have g(x) >t if and only if g* (x) >t. 
We will use the result 


E[g(X)] - f P(g(X) » t) dt- [' POO < =) dt 


from the preceding problem. The first term in the right-hand side is equal to 


If fx(z) dz dt E f fx(z)dtdr = ja g* (z)fx(z) dz. 
0 {z|9(z)>t} —oo J {t|0<t<g(z)} -= 


By a symmetrical argument, the second term in the right-hand side is given by 


oc oc 
| P(g(X) < -t) dt = f g (z)fx(x) dz. 
0 =c 
Combining the above equalities, we obtain 


E[s(X)] =f a" (z)fx(z)de~ f Y (xir - f g(r)fx(r)da. 


o -X 
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SECTION 3.2. Cumulative Distribution Functions 


Problem 5. Consider a triangle and a point chosen within the triangle according to 
the uniform probability law. Let X be the distance from the point to the base of the 
triangle. Given the height of the triangle, find the CDF and the PDF of X. 


Problem 6. Calamity Jane goes to the bank to make a withdrawal, and is equally 
likely to find 0 or 1 customers ahead of her. The service time of the customer ahead, 
if present, is exponentially distributed with parameter A. What is the CDF of Jane's 
waiting time? 


Problem 7. Alvin throws darts at a circular target of radius r and is equally likely 
to hit any point in the target. Let X be the distance of Alvin's hit from the center. 
(a) Find the PDF, the mean. and the variance of X. 


(b) The target has an inner circle of radius t. If X < t, Alvin gets a score of S — 1/X. 
Otherwise his score is 5 — 0. Find the CDF of S. Is S a continuous random 
variable? 


Problem 8. Consider two continuous random variables Y and Z, and a random 
variable X that is equal to Y with probability p and to Z with probability 1 — p. 
(a) Show that the PDF of X is given by 


fx(z) = pfy (z) + (1 — p)fz(z). 


(b) Calculate the CDF of the two-sided exponential random variable that has PDF 
given by 
_ pre**, if zr « 0, 
fx (2) l (l1—p)Ae>**. ifz 0, 


where à > 0 and0<p<1. 
Problem 9.* Mixed random variables. Probabilistic models sometimes involve 
random variables that can be viewed as a mixture of a discrete random variable Y and 
a continuous random variable Z. By this we mean that the value of X is obtained 
according to the probability law of Y with a given probability p. and according to the 
probability law of Z with the complementary probability 1 — p. Then, X is called a 
mized random variable and its CDF is given. using the total probability theorem, by 
Fx(z) = P(X € z) 
= pP(Y € z) + (1 — p)P(Z < z) 
= pFy (z) + (1 - p)Fz(). 


Its expected value is defined in a way that conforms to the total expectation theorem: 
E[X] = pE[Y] + (1 — p)E[Z]. 


The taxi stand and the bus stop near Al’s home are in the same location. Al goes 
there at a given time and if a taxi is waiting (this happens with probability 2/3) he 
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boards it. Otherwise he waits for a taxi or a bus to come, whichever comes first. The 
next taxi will arrive in a time that is uniformly distributed between 0 and 10 minutes, 
while the next bus will arrive in exactly 5 minutes. Find the CDF and the expected 
value of Al’s waiting time. 


Solution. Let A be the event that Al will find a taxi waiting or will be picked up by 
the bus after 5 minutes. Note that the probability of boarding the next bus, given that 
Al has to wait, is 


ond , ; 1 
P(a taxi will take more than 5 minutes to arrive) = 3 


Al’s waiting time, call it X, is a mixed random variable. With probability 


2- r5 
pit q ese 
(A e To 
it is equal to its discrete component Y (corresponding to either finding a taxi waiting, 
or boarding the bus), which has PMF 


2 
LT. df 
3P(A)' 1 y 0, 
py (y) = 1 
if y = 5, 
eP(Ay '" 
12 dus = 
z 15” if gm 0, 
3 
ITA ify = 
[This equation follows from the calculation 
P(Y =0,A 2 


The calculation for py (5) is similar.) With the complementary probability 1 — P(A), 
the waiting time is equal to its continuous component Z (corresponding to boarding a 
taxi after having to wait for some time less than 5 minutes), which has PDF 


| fM5, if0<2<5, 
fz(z) = io otherwise. 
The CDF is given by Fx(x) = P(A)FYy (x) + (1 — P(A))Fz(z), from which 

0, if z < 0, 

5 12 1 x 

Fx(xz)-— wmm obs Sa Eee a 
x (x) cH rd Mt if 0 € xr « 5, 
1, if5< r. 


The expected value of the waiting time is 


E[X] = P(A)E[Y] + (1 - P(A))E[Z] = 2^ i5: 5+3 ; z,:19 
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Problem 10.* Simulating a continuous random variable. A computer has a 
subroutine that can generate values of a random variable U that is uniformly distributed 
in the interval [0, 1]. Such a subroutine can be used to generate values of a continuous 
random variable with given CDF F(z) as follows. If U takes a value u, we let the value 
of X be a number z that satisfies F(r) — u. For simplicity, we assume that the given 
CDF is strictly increasing over the range S of values of interest. where S = (r|0 < 
F(z) « 1). This condition guarantees that for any u € (0. 1). there is a unique z that 
satisfies F(z) = u. 


(a) Show that the CDF of the random variable X thus generated is indeed equal to 
the given CDF. 


(b) Describe how this procedure can be used to simulate an exponential random 
variable with parameter A. 


(c) How can this procedure be generalized to simulate a discrete integer-valued ran- 
dom variable? 


Solution. (a) By definition, the random variables X and U satisfy the relation F(X) = 
U. Since F is strictly increasing, we have for every z, 


X<z if and only if F(X) € F(z). 
Therefore, 
P(X < z) = P(F(X) € F(z)) = P(U < F(z)) = F(z), 
where the last equality follows because U is uniform. Thus, X has the desired CDF. 


(b) The exponential CDF has the form F(z) = 1 — e^** for z > 0. Thus, to generate 
values of X, we should generate values u € (0.1) of a uniformly distributed random 
variable U. and set X to the value for which 1 — e^? = u, or z = — ln(1 — u)/A. 


(c) Let again F be the desired CDF. To any u € (0.1), there corresponds a unique 
integer Zu such that F(z, — 1) < u € F(zu). This correspondence defines a random 
variable X as a function of the random variable U. We then have. for every integer k, 


P(X = k) = P(F(k-1) < U < F(k)) = F(k) — F(k — 1). 


Therefore, the CDF of X is equal to F, as desired. 


SECTION 3.3. Normal Random Variables 
Problem 11. Let X and Y be normal random variables with means 0 and 1, respec- 
tively, and variances 1 and 4, respectively. 
(a) Find P(X € 1.5) and P(X < -1). 
(b) Find the PDF of (Y — 1)/2. 
(c) Find P(-1 < Y <1). 
Problem 12. Let X be a normal random variable with zero mean and standard 


deviation c. Use the normal tables to compute the probabilities of the events (X > ko} 
and {|X| € ko} for k = 1,2,3. 
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Problem 13. Acity's temperature is modeled as a normal random variable with mean 
and standard deviation both equal to 10 degrees Celsius. What is the probability that 
the temperature at a randomly chosen time will be less than or equal to 59 degrees 
Fahrenheit? 


Problem 14.* Show that the normal PDF satisfies the normalization property. Hint: 
2 
The integral f p e^? /? dz is equal to the square root of 


oc oe 
f / e Sa dz dy. 
=C =~00 


and the latter integral can be evaluated by transforming to polar coordinates. 


Solution. We note that 


(f ——e* "qr ae Erde] e V 7? dy 
-x V2r 2n J _ 56 = 


T oo 
1 oo € 2.2 
->f | e CA de dy 
— 0O —20 
2 oc 
2 
= = f e" "rdr d0 
0 0 


II 
a 
8 
o 
I 
g 
Q 
e 


where for the third equality, we use a transformation into polar coordinates, and for 
the fifth equality, we use the change of variables u — r?/2. Thus, we have 


oc 
l _ 32/9 
e dr =l, 
e 


because the integral is positive. Using the change of variables u = (z — u)/ø, it follows 
that 





em op 


Se Ex 2 52 2 
f fx (zx) z= | tree /20 az= | ons /2du = 1. 


SECTION 3.4. Joint PDFs of Multiple Random Variables 


Problem 15. A point is chosen at random (according to a uniform PDF) within a 
semicircle of the form {(z, y lr? +y <r, y> 0), for some given r > 0. 


(a) Find the joint PDF of the coordinates X and Y of the chosen point. 
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(b) Find the marginal PDF of Y and use it to find E[Y]. 


(c) Check your answer in (b) by computing E[Y] directly without using the marginal 
PDF of Y. 


Problem 16. Consider the following variant of Buffon's needle problem (Example 
3.11), which was investigated by Laplace. A needle of length l is dropped on a plane 
surface that is partitioned in rectangles by horizontal lines that are a apart and vertical 
lines that are b apart. Suppose that the needle's length l satisfies | < a and l < b. What 
is the expected number of rectangle sides crossed by the needle? What is the probability 
that the needle will cross at least one side of some rectangle? 


Problem 17.* Estimating an expected value by simulation using samples of 
another random variable. Let Y;...., Yn be independent random variables drawn 
from a common and known PDF fy. Let S be the set of all possible values of Y;, 

= (y|fv(y) > 0). Let X be a random variable with known PDF fx, such that 
fx(y) = 0, for all y € S. Consider the random variable 





Show that 
E[Z] = E[X]. 


Solution. We have 


e 9 - f IQ 5, ww)dy = | uis dv = EIX] 
S 





fy (Yi) * fv (v) 


Thus, 





E(Z] = - De i-o == EM = E[X]. 
SECTION 3.5. Conditioning 


Problem 18. Let X be a random variable with PDF 


_f2x/4, ifl<2<3, 
Jupe ls otherwise, 


and let A be the event in 22) 
(a) Find E[X], P(A), fxja(z), and E[X | A]. 
(b) Let Y = 2 Find E[Y] and var(Y ). 


Problem 19. The random variable X has the PDF 


cr?, ifl<x<2, 
0, otherwise. 
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(a) Determine the value of c. 


(b) Let A be the event (X » 1.5). Calculate P(A) and the conditional PDF of X 
given that A has occurred. 


(c) Let Y — X?. Calculate the conditional expectation and the conditional variance 
of Y given A. 


Problem 20. An absent-minded professor schedules two student appointments for the 
same time. The appointment durations are independent and exponentially distributed 
with mean thirty minutes. The first student arrives on time, but the second student 
arrives five minutes late. What is the expected value of the time between the arrival 
of the first student and the departure of the second student? 


Problem 21. We start with a stick of length £. We break it at a point which is chosen 
according to a uniform distribution and keep the piece, of length Y, that contains the 
left end of the stick. We then repeat the same process on the piece that we were left 
with, and let X be the length of the remaining piece after breaking for the second time. 


(a) Find the joint PDF of Y and X. 

(b) Find the marginal PDF of X. 

(c) Use the PDF of X to evaluate E[X J. 

(d) Evaluate E[X], by exploiting the relation X = Y - (X/Y). 

Problem 22. We have a stick of unit length, and we consider breaking it in three 
pieces using one of the following three methods. 


(i) We choose randomly and independently two points on the stick using a uniform 
PDF, and we break the stick at these two points. 


(ii) We break the stick at a random point chosen by using a uniform PDF, and then 
we break the piece that contains the right end of the stick. at a random point 
chosen by using a uniform PDF. 


(iii) We break the stick at a random point chosen by using a uniform PDF, and then 
we break the larger of the two pieces at a random point chosen by using a uniform 
PDF. 


For each of the methods (i), (ii), and (iii), what is the probability that the three pieces 
we are left with can form a triangle? 
Problem 23. Let the random variables X and Y have a joint PDF which is uniform 
over the triangle with vertices at (0,0), (0, 1), and (1.0). 
(a) Find the joint PDF of X and Y. 
(b) Find the marginal PDF of Y. 
(c) Find the conditional PDF of X given Y. 
) 


(d) Find E[X | Y = y], and use the total expectation theorem to find E[X] in terms 
of E[Y]. 


(e) Use the symmetry of the problem to find the value of E[X ]. 


c 
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Problem 24. Let X and Y be two random variables that are uniformly distributed 
over the triangle formed by the points (0,0), (1.0), and (0,2) (this is an asymmetric 
version of the PDF in the previous problem). Calculate E[X] and E[Y] by following 
the same steps as in the previous problem. 


Problem 25. The coordinates X and Y of a point are independent zero mean normal 
random variables with common variance c?. Given that the point is at a distance of 
at least c from the origin. find the conditional joint PDF of X and Y. 


Problem 26.* Let X;,.....X4 be independent random variables. Show that 


var( IT. X) 7p ( var(X;) 
Tis "lUo n) 


1l 


Solution. We have 


The desired result follows by dividing both sides by 


n 


[[ «xay. 


1=1 


Problem 27.* Conditioning multiple random variables on events. Let X 
and Y be continuous random variables with joint PDF fx.y, let A be a subset of the 
two-dimensional plane, and let C = ((X,Y) € A). Assume that P(C) > 0, and define 


fxy(z,y) . 
fx.yjc(z,y) = PO | if (z,y) € A, 


0, otherwise. 


(a) Show that fx.y\c is a legitimate joint PDF. 


(b) Consider a partition of the two-dimensional plane into disjoint subsets Ai, i — 
1,...,n, let C; = ((X.Y) € Ai), and assume that P(C;) > 0 for all i. Derive 
the following version of the total probability theorem 


fx,y(z,y) = »3 P(Ci)fx vc, (z. y). 


i=1 
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Problem 28.* Consider the following two-sided exponential PDF 


pe **, if x > 0, 


fx(2) = i (1-p)àe™, if z «0, 


where A and p are scalars with A > 0 and p € [0,1]. Find the mean and the variance 
of X in two ways: 


(a) By straightforward calculation of the associated expected values. 


(b) By using a divide-and-conquer strategy, and the mean and variance of the (one- 
sided) exponential random variable. 


Solution. (a) 


E[X] = l: Lfx (x) dz 


oc 








0 oc 
-j x - pieds + f rpAe "dr 
—oc 0 
Aedui cp 
= A TUM 
|. 2p- 1 
A 
epi | z^ fx (x) dz 
26 oc 
-j z'ü - pyre ds + f z?pAe " dz 
zx 0 
_ 2(1—p) | 2p 
rr xo 
m 
= 
and " P 3 
p-1 
var(X) = 3 - ( X Js 


(b) Let A be the event (X > 0), and note that P(A) = p. Conditioned on A, the 
random variable X has a (one-sided) exponential distribution with parameter A. Also, 
conditioned on A‘. the random variable —X has the same one-sided exponential dis- 


tribution. Thus, 

1 1 
= = AS = —— 
E[X | A] Y E[X | 47] Y 


and 9 
E[X? | A] = E[X?| A] = AT 


It follows that 
E[X] = P(A)E[X | A] + P(A‘S)E[X | A7] 


a Ip 
A A 
2p- 1 
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E[X?] = P( A)E[X? | A] + P(A*)E[X? | Af] 
2p 2(1— 
= 3+ 
2 
X2 
xd 2 2p —1\? 
var(X) = MC (2—) : 


Problem 29.* Let X, Y. and Z be three random variables with joint PDF fx.y.z- 
Show the multiplication rule: 


fx.v.z(2.9.2) = fxiy,z(z ly, z)fviz(u| z) fz(2). 


Solution. We have, using the definition of conditional density, 


fx.v.z(z. y, z) 


fx z(zly 2) = ET 


3 


and 
fv.z(y.2) = fyiz(y|z)fz(z). 
Combining these two relations, we obtain the multiplication rule. 


Problem 30.* The Beta PDF. The beta PDF with parameters a > 0 and 8 > 0 


has the form i 
a-1 B-l1 . 
—— I l-z , ifO0<zr<]l, 


0, otherwise. 


The normalizing constant is 


1 
B(o, 8) = I z^? (1—z)*^! dz, 
0 


and is known as the Beta function. 


(a) Show that for any m » 0, the mth moment of X is given by 


B(a + m, B) 


E[X"] = Bia, 8) 


(b) Assume that a and 2 are integer. Show that 


ee aum, etus eem) 
~ (a+ B(a*8-1):(a* B*m-1) 
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(Recall here the convention that 0! — 1.) 


Solution. (a) We have 


B(a +m, 8) 


m). 1 "deel -1 = 


(b) In the special case where a = 1 or 8 = 1, we can carry out the straightforward 
integration in the definition of B(a, 8), and verify the result. We will now deal with 
the general case. Let Y, Yi,..., Ya+,g be independent random variables, uniformly dis- 
tributed over the interval [0,1], and let A be the event 


A={¥i s+ < Ya SY Yea € S Yous). 


Then, 
1 


(a 4-8 4-1)" 


because all ways of ordering these a + 8 -- 1 random variables are equally likely. 
Consider the following two events: 


P(A) = 


B= { max{¥i,..., Ya} < Y}, C= {Y < min{Ya+1,--.,Ya+s}}. 


We have, using the total probability theorem, 
1 
P(BNC) = T P(BnClY = yfv(y) dy 
0 
1 
= f P ( max(Yi,..., Ya} < y € min{Yo+1,... Ya+a}) dy 
0 


1 
= f P( max{Y1...., Ya} < y) P(y € min{Yo+1,.-. Ya+a}) dy 
0 


1 
z y? (1 — y)? dy. 
0 


We also have 1 


because given the events B and C, all a! possible orderings of Yi,..., Ya are equally 
likely, and all 8! possible orderings of Yo+1,..-,Ya+g are equally likely. 
By writing the equation 


P(A|BnC)- 


P(A) = P(Bn C) P(A| Bn C) 


in terms of the preceding relations, we finally obtain 


1 1 An 
zc. y^ (1 - y)? dy, 
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or 


1 
a B = a! 68! 
/ y (1-9) dy = (ac 8-1) 


This equation can be written as 


a! 3! 


(a4 84 D^ for all integer a > 0, 8 > 0. 


Bia+1.8+1) = 


Problem 31.* Estimating an expected value by simulation. Let fx(r) be a 
PDF such that for some nonnegative scalars a, b, and c, we have fx(x) = 0 for all x 
outside the interval [a, b]. and zfx(z) € cfor all x. Let Y,,i = 1,...,n, be independent 
random variables with values generated as follows: a point (Vi, W;) is chosen at random 
(according to a uniform PDF) within the rectangle whose corners are (a, 0), (5,0). (a, c), 
and (b,c), and if W, € Vi fx (V.). the value of Y; is set to 1, and otherwise it is set to 0. 
Consider the random variable 





po Yi Fkt Y, 
n 
Show that E[X] 
Ee ea) 
and i 
« —, 
var(Z) € zm 


In particular, we have var(Z) — 0 as n — oo. 


Solution. We have 


P(Y; = 1) » P(W, d ad. 


vf x (v) 
-ff a ane 
f se )dv 


c(b — a) 
_ EiX] 
~ c(b—a) 





The random variable Z has mean P(Y; = 1) and variance 


P(Y, = 1)(1 - P(Y; = 1) 


var(Z) = z 


Since 0 < (1 — 2p)? = 1 — 4p(1 — p), we have p(1 — p) € 1/4 for any p in (0, 1], so it 
follows that var(Z) < 1/(4n). 
Problem 32.* Let X and Y be continuous random variables with joint PDF fx y. 


Suppose that for any subsets A and B of the real line, the events (X € A} and (Y € B} 
are independent. Show that the random variables X and Y are independent. 
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Solution. For any two real numbers z and y, using the independence of the events 
(X € z} and (Y < y), we have 


Fx.y (x,y) = PX € zx. Y < y 2 P(X € z) P(Y € y) = Fx(z)Fv (y). 
Taking derivatives of both sides, we obtain 


EUER a 
Oxdy Y = “Or ðy 





fx. y (z, y) = (y) = fx (x)Mfv (y), 
which establishes that X and Y are independent. 


Problem 33.* The sum of a random number of random variables. You visit 
a random number N of stores and in the ith store, you spend a random amount of 
money Xi. Let 


T — Xi Xa c XN 


be the total amount of money that you spend. We assume that N is a positive integer 
random variable with a given PMF, and that the X, are random variables with the 
same mean E[X] and variance var( X). Furthermore, we assume that N and all the X; 
are independent. Show that 


E[T]- E[X]E[N], ^ and  var(T)- var(X) E[N] + (E[X]) var(N). 


Solution. We have for all i, 
E[T|N =i] = 4E[X], 


since conditional on N = i, you will visit exactly i stores, and you will spend an 
expected amount of money E[X] in each. 
We now apply the total expectation theorem. We have 


Me 


E[T| = V P(N - ) E[TIN =i] 


7 


P(N = i)iE[X] 


M 


1 


E[X] V iP(N =i) 


Lad 
M 


= E[X] E[N]. 
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Similarly, using also the independence of the X;, which implies that E[X; X;] = (Eix])* 
if i Æ j, the second moment of T is calculated as 


P(N — i)E[T?|N - i] 


Mes 


E(T’] = 


~ 
L 
Es 


P(N -i)E[(QX -- - Xv IN = i] 


I 
Š 


Mi 


P(N = i) (iE[X?] + i(i — 1) (E[X])’) 


Me 


1 


E[X?] S  iP(N = à) + (E[X])’ Y^ i6 - )P(N = i) 


i 


= E[X?] E[N] + (E(X])" (E[N?] — E(N]) 
= var( X) E[N] + (E[X]) E[N?]. 


The variance is then obtained by 


var(T) = E[T?] - (E[T])* 
= var(X) E[N] + (E[X]) E(N?] - (EUX])^ (ElN])- 
= va( X) E[N] + (E[X])" (E(N?] - (EIN])). 
so finally 
var(T) = var(X) E[N] + (E[X]) var(N). 


Note: The formulas for E[T] and var(T) will also be obtained in Chapter 4, using a 
more abstract approach. 


SECTION 3.6. The Continuous Bayes’ Rule 


Problem 34. A defective coin minting machine produces coins whose probability of 
heads is a random variable P with PDF 


_ fpe’, pe [0,1], 
fr(p) = { 0, otherwise. 


A coin produced by this machine is selected and tossed repeatedly, with successive 
tosses assumed independent. 

(a) Find the probability that a coin toss results in heads. 

(b) Given that a coin toss resulted in heads, find the conditional PDF of P. 

(c) Given that the first coin toss resulted in heads, find the conditional probability 


of heads on the next toss. 


Problem 35.* Let X and Y be independent continuous random variables with PDFs 
fx and fy, respectively, and let Z = X +Y. 


Problems 199 


(a) Show that fz)x(z|z) = fv(z— z). Hint: Write an expression for the conditional 
CDF of Z given X, and differentiate. 


(b) Assume that X and Y are exponentially distributed with parameter A. Find the 
conditional PDF of X, given that Z = z. 


(c) Assume that X and Y are normal random variables with mean zero and variances 
c? and oy respectively. Find the conditional PDF of X, given that Z = 2. 


Solution. (a) We have 


P(Z&X2|X-2z)-2P(X-Y €2|X — az) 
=P(r+Y <2z|X =2) 
=P(z+Y <2) 
= P(Y € z - z), 


where the third equality follows from the independence of X and Y. By differentiating 
both sides with respect to z, the result follows. 


(b) We have, for 0 € z < z, 
fxig(zlz) = fex ltr). frl- m xn) _ Ae o Dae o Men 
XIZ = fz(2) E fz(2) ~ fz(2) Ez 


Since this is the same for all z, it follows that the conditional distribution of X is 
uniform on the interval [0, 2], with PDF fxjz(r|2) = 1/2. 





(c) We have 
_ fv(z-z)fx(z) — 1 . 1  ,-G-2?/:2  l ,-i?/352 
fxiz(z |2) = LU Ae e y pas ' 


We focus on the terms in the exponent. By completing the square, we find that the 
negative of the exponent is of the form 











(z—z)? r œ+? ( 202 j 2? ( o? ) 


205 202 202o? 02 +o? 2c2 o +o? 


Thus, the conditional PDF of X is of the form 


2 2 2 2 

ET ZOr 2 

i | 2) = cl2) exp <= = = —— — ; 
fxiz( | ) (2) pf 20202 oz cà 


where c(z) does not depend on z and plays the role of a normalizing constant. We 
recognize this as a normal distribution with mean 


and variance 
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In this chapter. we develop a number of more advanced topics. We introduce 
methods that are useful in: 


(a) deriving the distribution of a function of one or multiple random variables; 


(b) dealing with the sum of independent random variables. including the case 
where the number of random variables is itself random; 


(c) quantifying the degree of dependence between two random variables. 


With these goals in mind. we introduce a number of tools. including transforms 
and convolutions. aud we refine our understanding of the concept of conditional 
expectation. 

The material in this chapter is not needed for Chapters 5-7. with the ex- 
ception of the solutions of a few problems. and may be viewed as optional in 
a first reading of the book. On the other hand, the concepts and methods dis- 
cussed here constitute essential background for a more advanced treatment of 
probability and stochastic processes. and provide powerful tools in several disci- 
plines that rely on probabilistic models. Furthermore, the concepts introduced 
in Sections 4.2 and 4.3 will be required in our study of inference and statistics, 
in Chapters 8 and 9. 


DERIVED DISTRIBUTIONS 


In this section, we consider functions Y = g( X) of a continuous random variable 
X. We discuss techniques whereby, given the PDF of X, we calculate the PDF 
of Y (also called a der?ved distribution). The principal method for doing so is 
the following two-step approach. 


Calculation of the PDF of a Function Y = g(X) of a Continuous 
Random Variable X 


1. Calculate the CDF Fy of Y using the formula 


y) ENT S 


2. Differentiate to obtain the PDF of Y: 





Example 4.1. Let X be uniform on [0.1]. and let Y = VX. We note that for 
every y € [0,1]. we have 


Fy(y -P(Yxy-P(VXxy) -P(Xxy)-y. 
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We then differentiate and obtain 
fe = Sx) La osy«i 
dy dy 2 Eu ber 


Outside the range [0,1], the CDF Fy (y) is constant, with Fy (y) = 0 for y € 0, and 
Fy (y) = 1 for y > 1. By differentiating, we see that fy (y) = 0 for y outside [0, 1]. 


Example 4.2. John Slow is driving from Boston to the New York area, a distance 
of 180 miles at a constant speed, whose value is uniformly distributed between 30 
and 60 miles per hour. What is the PDF of the duration of the trip? 

Let X be the speed and let Y = g(X) be the trip duration: 


180 
g(X) = >. 


To find the CDF of Y, we must calculate 
PY «y =P (> <y) = p (2 <x). 
X y 


We use the given uniform PDF of X, which is 


raus if 30 € x < 60, 
fx(z)- 
otherwise, 


0, 


and the corresponding CDF, which is 


0, if z < 30, 
Fx(r)2 4 (2—30)/30, if 30 € x < 60, 
l, if 60 < zx. 


Thus, 


0, if y < 180/60, 
180 _ 59 
a ee zg ~> if 180/60 < y < 180/30, 
1, if 180/30 < y, 


0, if y < 3, 
{2-(6/n if3<y <6, 


1, if6<y, 
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(see Fig. 4.1). Differentiating this expression. we obtain the PDF of Y: 


0. ify <3. 
fy(y) = 4 6/y?. if3<y<6, 
0. if 6 < y. 


Example 4.3. Let Y = g(X) = X?. where X is à random variable with known 
PDF. For any y > 0, we have 


Fy(y) = P(Y € y) 
= P(X’ <y) 
-P(-V/g € X € y) 
= Fx(V/y) - Fx(- vY), 
and therefore. by differentiating and using the chain rule, 


fic ggf C + gg V9). y. 


į PDF fy(z) 





CDF Fy(y) 


PDF fy(y) 





Figure 4.1: The calculation of the PDF of Y = 180/X in Example 4.20. The 
arrows indicate the flow of the calculation. 
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The Linear Case 


We now focus on the important special case where Y is a linear function of X; 
see Fig. 4.2 for a graphical interpretation. 


faxa 





Figure 4.2: The PDF of aX +b in terms of the PDF of X. In this figure, a = 2 
and b = 5. As a first step, we obtain the PDF of aX. The range of Y is wider than 
the range of X, by a factor of a. Thus, the PDF fx must be stretched (scaled 
horizontally) by this factor. But in order to keep the total area under the PDF 
equal to 1, we need to scale down the PDF (vertically) by the same factor a. The 
random variable aX + b is the same as aX except that its values are shifted by 
b. Accordingly, we take the PDF of a X and shift it (horizontally) by b The end 
result of these operations is the PDF of Y = aX +b and is given mathematically 


kd 1 yst 
fe) = A fx (=) 


|a| a 


If a were negative, the procedure would be the same except that the PDF 
of X would first need to be reflected around the vertical axis (“flipped”) yielding 
Í[- x. Then a horizontal and vertical scaling (by a factor of |a| and 1/|a|. respec- 
tively) yields the PDF of —|a|X = aX. Finally, a horizontal shift of b would again 


yield the PDF of aX +b. 


The PDF of a Linear Function of a Random Variable 


Let X be a continuous random variable with PDF fx, and let 
Y =aX +b, 


where a and b are scalars, with a #0. Then, 


fly) = Tx (=>) 





a 
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To verify this formula, we first calculate the CDF of Y and then differenti- 
ate. We only show the steps for the case where a > 0; the case a « 0 is similar. 
We have 











Example 4.4. A Linear Function of an Exponential Random Variable. 
Suppose that X is an exponential random variable with PDF 


-Àr ; 
f = UY 2 Em, 


0, otherwise, 
where A is a positive parameter. Let Y = aX +b. Then, 


A e-My-B/a. if (y — b)/a > 0, 
fv (y) = 4 lal 
0, otherwise. 
Note that if b = 0 and a > O0, then Y is an exponential random variable with 
parameter A/a. In general, however, Y need not be exponential. For example, if 
a < 0 and b= 0, then the range of Y is the negative real axis. 


Example 4.5. A Linear Function of a Normal Random Variable is Nor- 
mal. Suppose that X is a normal random variable with mean p and variance o°, 


and let Y = a X + b, where a and b are scalars, with a zz 0. We have 


fx(z) = ae" e e 
2r c 
Therefore, 





fv(y) = iE (=) 
Hoo 


1 
= ———-_ ex oo 
Jm aa ” { 2a2q? 


We recognize this as a normal PDF with mean ap + b and variance a?co?. In 


particular, Y is a normal random variable. 
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The Monotonic Case 


The calculation and the formula for the linear case can be generalized to the 
case where g is a monotonic function. Let X be a continuous random variable 
and suppose that its range is contained in a certain interval J, in the sense that 
fx(z) = 0 for z ¢ I. We consider the random variable Y = g(X), and assume 
that g is strictly monotonic over the interval J. so that either 


(a) g(z) < g(z’) for all z. z' € I satisfying z < x’ (monotonically increasing 
case). or 


(b) g(x) > g(z') for all z, z' € I satisfying z < zr’ (monotonically decreasing 
case). 


Furthermore. we assume that the function g is differentiable. Its derivative 
will necessarily be nonnegative in the increasing case and nonpositive in the 
decreasing case. 

An important fact is that a strictly monotonic function can be "inverted" 
in the sense that there is some function A, called the inverse of g. such that for 
all r € I, we have 


y-9(zr)  ifandonlyif z= (y). 


For example, the inverse of the function g(x) = 180/z considered in Example 4.2 
is h(y) = 180/y, because we have y = 180/z if and only if r = 180/y. Other 
such examples of pairs of inverse functions include 





—b 
g(r) =ar+b. h(y) = — 
where a and b are scalars with a Æ 0, and 
lny 
g(x) = e%, h(y) = ee 


where a is a nonzero scalar. 
For strictly monotonic functions g, the following is a convenient analytical 
formula for the PDF of the function Y = g( X). 





PDF Formula for a Strictly Monotonic Function of a Continuous 
Random Variable 


Suppose that g is strictly monotonic and that for some function h and all z 
in the range of X we have 


y=g(z)  ifandonlyif z= (y). 
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Assume that h is differentiable. Then, the PDF of Y in the region where 
fv(y) » 0 is given by 


dyes hd) Fol | 





For a verification of the above formula, assume first that g is monotonically 
increasing. Then. we have 


Fy (y) = P(g(X) € y) 2 P(X € h(y)) = Fx(h(y)). 


where the second equality can be justified using the monotonically increasing 
property of g (see Fig. 4.3). By differentiating this relation, using also the chain 


rule, we obtain 
dFy dh 
= ——(y) = h — (y). 
fy (y) dj (y) = fx (h(y)) 2j (y) 
Because g is monotonically increasing. h is also monotonically increasing, so its 
derivative is nonnegative: 








This justifies the PDF formula for a monotonically increasing function g. The 
justification for the case of monotonically decreasing function is similar: we 
differentiate instead the relation 


Fy (y) = P(g(X) x y) = P(X > h(y)) = 1 - Fx(h(y)). 


and use the chain rule. 






y = gir) 








t h(y) T ACY) T 
Event {X< hig} Event {X> hiyi) 


Figure 4.3: Calculating the probability P(9(X) < y). When g is monotonically 
increasing (left figure). the event {g(X) < y} is the same as the event {X < h(y)}. 
When g is monotonically decreasing (right figure), the event {g(X) < y} is the 
same as the event (X > h(y)). 
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Example 4.2 (continued). To check the PDF formula, let us apply it to the 
problem of Example 4.2. In the region of interest, x € [30,60], we have h(y) = 
180/y, and 


fx (h()) = a: AO) ==>. 


Thus, in the region of interest y € [3,6], the PDF formula yields 


dh ..1 180 6 
^ 30 y? y? 


fv(y) = fx (h(y)) Bw 
consistent with the expression obtained earlier. 


Example 4.6. Let Y = g(X) = X?, where X is a continuous uniform random 
variable on the interval (0, 1]. Within this interval, g is strictly monotonic, and its 
inverse is h(y) = \/y. Thus, for any y € (0.1], we have 


dh 1 
fx (Vy) 7 1. Zol- zm 
and 
uu i Ls if y € (0. 1]. 
0, otherwise. 


We finally note that if we interpret PDFs in terms of probabilities of small 
intervals, the content of our formulas becomes pretty intuitive: see Fig. 4.4. 


Functions of Two Random Variables 


The two-step procedure that first calculates the CDF and then differentiates to 
obtain the PDF also applies to functions of more than one random variable. 


Example 4.7. T wo archers shoot at a target. The distance of each shot from the 
center of the target is uniformly distributed from 0 to 1, independent of the other 
shot. What is the PDF of the distance of the losing shot from the center? 

Let X and Y be the distances from the center of the first and second shots. 
respectively. Let also Z be the distance of the losing shot: 


Z = max(X,Y ). 
We know that X and Y are uniformly distributed over [0. 1], so that for all z € [0. 1]. 


we have 


P(X €2) 2 P(Y € z) =<. 
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dg (x) 
slope ——iri 
y F de^ 


[u. i 5] 





[r. 78] 


Figure 4.4: Illustration of the PDF formula for a monotonically increasing func- 
tion g. Consider an interval [z,z + ĝi], where ó1 is a smal] number. Under 
the mapping g. the image of this interval is another interval [y, y + 62]. Since 
(dg/dx)(z) is the slope of g, we have 


We now note that the event {z € X € z -ói) is the same as the event (y € Y < 
y +62}. Thus, 
fy (y)d2 = Ply < Y < y + 62) 


=P(r< X <24+4;) 
= fx(x)ài. 


We move ô; to the left-hand side and use our earlier formula for the ratio 62/6;, 
to obtain 


fro) Za) = fx(2). 


Alternatively, if we move dé2 to the right-hand side and use the formula for 6; /é2, 
we obtain 


dh 
fy (y) = fx (h()) gh 


Thus, using the independence of X and Y, we have for all z € [0,1], 


Fz(z) = P(max(X,Y) < z) 
= P(X <2, Y <2) 
= P(X € z) P(Y < z) 


= 22. 
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Figure 4.5: The calculation of the CDF of Z = Y/X in Example 4.8. The value 
P(Y/X < z) is equal to the shaded subarea of the unit square. The figure on the 
left deals with the case where 0 € z € 1 and the figure on the right refers to the 
case where z > 1. 


Differentiating, we obtain 


22; af0 «241, 
0, otherwise. 


fa(z) = { 


Example 4.8. Let X and Y be independent random variables that are uniformly 
distributed on the interval [0,1]. What is the PDF of the random variable Z = 
Y/X? 

We will find the PDF of Z by first finding its CDF and then differentiating. 
We consider separately the cases 0 < z < 1 and z > 1. As shown in Fig. 4.5, we 
have 


Y 2/2; if0<2z2<1, 
Fe(2)=P (5 <2) SA cod apes 
0, otherwise. 
By differentiating, we obtain 
1/2, ifo«z«1, 
fz(z) = < (222), ifz>1, 
0, otherwise. 


Example 4.9. Romeo and Juliet have a date at a given time, and each, indepen- 
dently, will be late by an amount of time that is exponentially distributed with 
parameter A. What is the PDF of the difference between their times of arrival? 
Let us denote by X and Y the amounts by which Romeo and Juliet are late, 
respectively. We want to find the PDF of Z = X — Y, assuming that X and Y are 
independent and exponentially distributed with parameter A. We will first calculate 
the CDF Fz(z) by considering separately the cases z > 0 and z « 0 (see Fig. 4.6). 
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line zr -y =z - hnez-ysz 
^. 


v 


x 





Figure 4.6: The calculation of the CDF of Z = X — Y in Example 4.9. To 
obtain the value P(X — Y > z) we must integrate the joint PDF fy y (x. y) 
over the shaded area in the above figures, which correspond to z > O (left 
side) and z < 0 (right side). 


For z > 0, we have (see the left side of Fig. 4.6) 
Fz(z) = P(X-Y <z) 


=1-P(X -Y>2z) 


=1 -f ( fxv(zu)de) dy 
0 zt+y 
i- | Ae v ( Mes az) dy 
0 z+y 


ox 
1- f Ae? ve >+) dy 
0 


ac 
mines f ae dy 
0 


= ] — d 
2 
For the case z « 0, we can use a similar calculation, but we can also argue 
using symmetry. Indeed, the symmetry of the situation implies that the random 
variables Z = X — Y and —Z = Y — X have the same distribution. We have 
F2(z) = P(Z < z) = P(-CZ > =z) = P(Z > -z) = 1- Fz(-2). 
With z « 0, we have —z > 0 and using the formula derived earlier, 
loai Iz 
F2(z) 21- Fz(-z) 21- ( — ze ? = le". 
2 2 
Combining the two cases z > 0 and z < 0, we obtain 


ies se. if z » 0, 
Fz(z) = : 
oe 


26 ^ if z « 0, 
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We now calculate the PDF of Z by differentiating its CDF. We have 


or 


fz(z) = 2e N 


This is known as a two-sided exponential PDF, also called the Laplace PDF. 


Sums of Independent Random Variables — Convolution 


We now consider an important example of a function Z of two random variables, 
namely, the case where Z = X + Y. for independent X and Y. For some initial 
insight, we start by deriving a PMF formula for the case where X and Y are 
discrete. 

Let Z = X + Y, where X and Y are independent integer-valued random 
variables with PMFs px and py, respectively. Then, for any integer z, 


pz(z)= P(X + Y =2) 
= ), P(X =2,Y =y) 
{(z.y) |z+y=2} 


=) P(X =2,Y =z-2) 


= $ px(x)py (z - 2). 


The resulting PMF pz is called the convolution of the PMFs of X and Y. See 
Fig. 4.7 for an illustration. 

Suppose now that X and Y are independent continuous random variables 
with PDFs fx and fy, respectively. We wish to find the PDF of Z = X +Y. 
Towards this goal, we will first find the joint PDF of X and Z, and then integrate 
to find the PDF of Z. 

We first note that 


P(Z< 2z2|X =2)=P(X+Y <2z|X =r) 
(c+Y <2z|X =z) 
( 
( 


r+Y <2) 
Y <z-2), 


P 
P 
P 


where the third equality follows from the independence of X and Y. By differ- 
entiating both sides with respect to z, we see that fz|x(z|z) = fv(z — x). Using 
the multiplication rule, we have 


fx,z(z.2) = fx (z).zix (z|z) = fx(z)fv(z — x), 
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Figure 4.7: The probability pz (3) that X --Y = 3 is the sum of the probabilities 
of all pairs (z, y) such that «+ y = 3, which are the points indicated in the figure. 
The probability of a generic such point is of the form 


Dx,v (1,3 — z) = px(z)pv(3 - z). 
from which we finally obtain 


fz(z) : fx,2(x,z) dz = a fx(x)fy(z — x) dx. 


This formula is entirely analogous to the one for the discrete case, except that 
the summation is replaced by an integral and the PMIF's are replaced by PDFs. 
For an intuitive interpretation, see Fig. 4.8. 


Example 4.10. The random variables X and Y are independent and uniformly 
distributed in the interval [0,1]. The PDF of Z 2 X +Y is 


fz(z) xi fx (z)fv(z — 2) dz. 


The integrand fx(z)fy(z — z) is nonzero (and equal to 1) for 0 € z € 1 and 
0 < z-—2z < 1. Combining these two inequalities, the integrand is nonzero for 
max(0,z — 1) € z € min(1,2). Thus, 


min{1, z} - max{0,z-1}, 0€2 € 2, 
fz(z) = : 
0, otherwise, 


which has the triangular shape shown in Fig. 4.9. 


We next describe an important application of the convolution formula. 


Example 4.11. The Sum of Two Independent Normal Random Variables 
is Normal. Let X and Y be independent normal random variables with means 
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Figure 4.8: Illustration of the convolution formula for the case of continuous 
random variables (compare with Fig. 4.7). For small ó, the probability of the 
strip indicated in the figure is P(z < X + Y < z 4-6) = fz(z)ó. Thus. 


fz(z)6 =P(z2<X+Y < 24+) 


2z—r44 
4d L- fx (x)/v (y) dy dx 


RI T fx(zMv(z — z)ódr. 


The desired formula follows by canceling 6 from both sides. 


Figure 4.9: The PDF of the sum 
of two independent uniform random 
variables in (0. 1]. 





Hz, Hy: and variances c2. 9; respectively. and let Z = X + Y. We have 





rs earan i { es) 
E et -y y h de. 
a J. VITO: exp { 202 VOT a, i 202 z 


This integral can be evaluated in closed form. but the details are tedious and are 
omitted. The answer turns out to be of the form 


fi = Mm exp { - Sx 


2n(o2 + a2 2(o$ + oi) 
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Figure 4.8: Illustration of the convolution formula for the case of continuous 
random variables (compare with Fig. 4.7). For small ó, the probability of the 
strip indicated in the figure is P(z < X + Y € z - ó) « fz(z)ó. Thus, 


fz(:)é-— P(zz X +Y <2+4) 


^c z—r4dó 
- f f Ix (z)fv (y) dy dz 


x f fx(z)fy (z — z)ádz. 


The desired formula follows by canceling ô from both sides. 


Figure 4.9: The PDF of the sum 
of two independent uniform random 
variables in [0. 1]. 





Hz, Hy, and variances a 05. respectively. and let Z = X + Y. We have 





OMEN F3 (£ - pe)? 1 (zx ~ py)? 
pels) =f VEG exp { - 202 exp { - 203 y Jas. 


This integral can be evaluated in closed form, but the details are tedious and are 
omitted. The answer turns out to be of the form 


fa(2) = exp { - G7 t 10^ V 


21(c2 + 02) 2(oi + 04) 
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which we recognize as a normal PDF with mean pz + py and variance c2 + ci. 
We therefore reach the conclusion that the sum of two independent normal random 
variables is normal. Given that scalar multiples of normal random variables are 
also normal (cf. Example 4.5), it follows that a.X 4- bY is also normal, for any 
nonzero a and b. An alternative derivation of this important fact will be provided 
in Section 4.4, using transforms. 


Example 4.12. The Difference of Two Independent Random Variables. 
The convolution formula can also be used to find the PDF of X — Y , when X and 
Y are independent, by viewing X — Y as S sum of X and —Y. We observe that 
the PDF of —Y is given by f-y (y) = fy(—y), and obtain 


fx-y(z ejf f[x(z)f-v aed fx(z)fv(z — z) dz 


As an illustration, consider the case where X and Y are independent expo- 
nential random variables with parameter A, as in Example 4.9. Fix some z > 0 and 
note that fy(z — z) is nonzero only when z > z. Thus, 


fx-y(z) =| fx (z)fv(z — z)dz 


oc 
= i Ne Ne 079 dr 
z 


in agreement with the result obtained in Example 4.9. The answer for the case 
z < 0 is obtained with a similar calculation or, alternatively. by noting that 


fx-v(z) = fr-x(z) = f-(x-vy(z) = fx-v(-2), 
where the first equality holds by symmetry, since X and Y are identically dis- 
tributed. 


When applying the convolution formula, often the most delicate step was 
to determine the correct limits for the integration. This is often tedious and 
error prone, but can be bypassed using a graphical method described next. 


Graphical Calculation of Convolutions 


We use a dummy variable ¢ as the argument of the different functions involved 
in this discussion; see also Fig. 4.10. Consider two PDFs fx(t) and fy (t). For 
a fixed value of z, the graphical evaluation of the convolution 


zo f E c 
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consists of the following steps: 


(a) We plot fy(z — t) as a function of t. This plot has the same shape as the 
plot of fy (t) except that it is first "flipped" and then shifted by an amount 
z. If z > 0. this is a shift to the right, if z < 0. this is a shift to the left. 


(b) We place the plots of fx(t) and fy (z — t) on top of each other. and form 
their product. 


(c) We calculate the value of fz(z) by calculating the integral of the product 
of these two plots. 


By varying the amount z by which we are shifting. we obtain fz(z) for any z. 





Figure 4.10: Illustration of the convolution calculation. For the value of z under 
consideration. fz(z) is equal to the integral of the function shown in the last plot. 


4.2 COVARIANCE AND CORRELATION 


In this section. we introduce a quantitative measure of the strength and direction 
of the relationship between two random variables. It plavs an important role in 
many contexts. and it will be used in the estimation methodology to be developed 
in Chapters 8 and 9. 
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The covariance of two random variables X and Y, denoted by cov( X.Y). 
is defined by 


cov(X. Y) = E[(X - E[X]) (Y - EtY])]. 


When cov(X.Y) = 0. we say that X and Y are uncorrelated. Roughly speak- 
ing. a positive or negative covariance indicates that the values of X — E[X] and 
Y — E[Y] obtained in a single experiment "tend" to have the same or the oppo- 
site sign, respectively (see Fig. 4.11). Thus the sign of the covariance provides 
an important qualitative indicator of the relationship between X and Y. 





y y à 
TTA | >. 
di. j | X ^N 
( F | } 
| 


Figure 4.11: Examples of positively and negatively correlated random variables. 
Here. X and Y are uniformly distributed over the ellipses shown in the figure. In 
case (a) the covariance cov( X.Y ) is positive, while in case (b) it is negative. 


An alternative formula for the covariance is 
cov(X, Y) = E[XY] - E[X] E[Y], 


as can be verified by a simple calculation. We record a few properties of covari- 
ances that are easily derived from the definition: for any random variables X. 
Y, and Z, and any scalars a and b, we have 


cov( X. X) = var( X). 
cov( X. aY +b) 2 a: cov( X.Y). 
cov( X. Y + Z) = cov(X. Y) + cov(X.Z). 


Note that if X and Y are independent, we have E[XY | = E[X] E[Y ]. which 
implies that cov(.X. Y) = 0. Thus, if X and Y are independent, they are also 
uncorrelated. However, the converse is generally not true, as illustrated by the 
following example. 


Example 4.13. The pair of random variables (X. Y) takes the values (1,0), (0, 1). 
(—1,0). and (0. —1). each with probability 1/4 (see Fig. 4.12). Thus, the marginal 
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PMFs of X and Y are symmetric around 0, and E[X] = E[Y] = 0. Furthermore, 
for all possible value pairs (zr, y), either z or y is equal to 0, which implies that 
XY =0 and E[XY] = 0. Therefore, 


cov( X, Y) = E(XY] — E[X] E[Y] = 0, 


and X and Y are uncorrelated. However, X and Y are not independent since, for 
example, a nonzero value of X fixes the value of Y to zero. 
This example can be generalized. In particular, assume that X and Y satisfy 


E[X |Y = y] = E[ X], for all y. 
Then, assuming X and Y are discrete, the total expectation theorem implies that 
E(XY] = $` ypy (y)E[X |Y = y] = E(X] 9 y ypy (y) = EIX] EY], 
y y 


so X and Y are uncorrelated. The argument for the continuous case is similar. 


Figure 4.12: Joint PMF of X and Y 
for Example 4.13. Each of the four 
points shown has probability 1/4. Here 
X and Y are uncorrelated but not in- 
dependent. 





The correlation coefficient p(X,Y) of two random variables X and Y 
that have nonzero variances is defined as 


cov( X, Y) 


RO SECOS YT. 


(The simpler notation p will also be used when X and Y are clear from the 
context.) It may be viewed as a normalized version of the covariance cov( X,Y), 
and in fact, it can be shown that p ranges from —1 to 1 (see the end-of-chapter 
problems). 

If p > 0 (or p < 0), then the values of X — E[X| and Y — E|Y] “tend” 
to have the same (or opposite. respectively) sign. The size of |p| provides a 
normalized measure of the extent to which this is true. In fact, always assuming 
that X and Y have positive variances, it can be shown that p = 1 (or p = —1) 
if and only if there exists a positive (or negative, respectively) constant c such 
that 

Y - EY] 2 c(X - E(X]) 
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(see the end-of-chapter problems). The following example illustrates in part this 
property. 
Example 4.14. Consider n independent tosses of a coin with probability of a head 
equal to p. Let X and Y be the numbers of heads and of tails, respectively, and let 
us look at the correlation coefficient of X and Y. Here, we have X + Y = n, and 
also E[X]+ E[Y] = n. Thus, 
X - E[X] = - (Y - E[Y]). 
We will calculate the correlation coefficient of X and Y, and verify that it is indeed 


equal to —1. 
We have 


cov(X,¥) = E[(X - E[X])(¥ - etv])] 
=-E [x E Ex) 
= —var(X). 
Hence, the correlation coefficient is 


co(X,Y) |. -va(X)  — — 


var( X)var(Y) i Jf var( X)var( X) B 


AX,Y) = 


Variance of the Sum of Random Variables 
The covariance can be used to obtain a formula for the variance of the sum 
of several (not necessarily independent) random variables. In particular, if 
X1,X2,...,Xn are random variables with finite variance, we have 
var(X1 + X2) = var(.X1) + var(.X2) + 2cov( X1, X2), 
and, more generally, 
LU n 
var (Y x) = X var(Xi) + 5 cov( Xi, X;). 
i=l i=) (0,3) | 249} 


This can be seen from the following calculation, where for brevity, we denote 
Xi = Xi = E[X;]: 


e (9) 


i=l 
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i=1 j-—1 
-Y YEA 
i=l j=1 
- YE [z] + J,  E(XX 
i=l ((.3) Liz) 
= X var(X;) + 5. cov( Xi, X;). 
i=l ((,3) | #43} 


The following example illustrates the use of this formula. 


Example 4.15. Consider the hat problem discussed in Section 2.5, where n people 
throw their hats in a box and then pick a hat at random. Let us find the variance 
of X, the number of people who pick their own hat. We have 


AX p X 


where X, is the random variable that takes the value 1 if the ith person selects 
his/her own hat, and takes the value 0 otherwise. Noting that X; is Bernoulli with 
parameter p = P(X; = 1) = 1/n, we obtain 


1 


E[X;] = -, var( Xi) = - (1 — =) : 


1 
n 
For i # j, we have 


cov(X;, Xj) = E[X; Xj] - E[X;] E[X;] 
plan eae ee 
nn 


= P(X: = P(X} = 1]Xi = 1)- 
Lohr re 
^n n-1 m 
" 1 
n?(n — 1) 


Therefore, 


= Y var( X) + 5 cov( Xi, X5) 


i=l (G3) Hiz3) 
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Covariance and Correlation 


e The covariance of X and Y is given by 


cov(X, Y) = E[(x - Eix] (Y - E[Y])] = E[XY] - E[X] E[Y]. 


e If cov( X, Y) — 0, we say that X and Y are uncorrelated. 


e If X and Y are independent, they are uncorrelated. The converse is 
not always true. 


e We have 


var(X + Y) = var(X)  var(Y)  2cov(X, Y). 


e The correlation coefficient o(X,Y) of two random variables X and 
Y with positive variances is defined by 


cov( X, Y) 


pe /var(X)var(Y) | 


and satisfies 





4.3 CONDITIONAL EXPECTATION AND VARIANCE REVISITED 


In this section, we revisit the conditional expectation of a random variable X 
given another random variable Y , and view it as a random variable determined 
by Y. We derive a reformulation of the total expectation theorem. called the law 
of iterated expectations. We also obtain a new formula, the law of total 
variance, that relates conditional and unconditional variances. 

We introduce a random variable, denoted by E[X | Y], that takes the value 
E[X |Y = y] when Y takes the value y. Since E[X |Y = y] is a function of y, 
E(X |Y] is a function of Y . and its distribution is determined by the distribution 
of Y. The properties of E[X | Y] will be important in this section but also later, 
particularly in the context of estimation and statistical inference, in Chapters 8 
and 9. 


Example 4.16. We are given a biased coin and we are told that because of 
manufacturing defects, the probability of heads. denoted by Y, is itself random, 
with a known distribution over the interval [0. 1]. We toss the coin a fixed number 
n of times, and we let X be the number of heads obtained. Then. for any y € (0. 1], 
we have E(X |Y = y] = ny. so E[X |Y] is the random variable nY. 
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Since E[X | Y] is a random variable, it has an expectation E[E[X | Y]] of 
its own, which can be calculated using the expected value rule: 


ye E[X | Y = ylpy (y). Y discrete, 
E[E[X|Y] - 4 * 
l E[X |Y = y] fy (y) dy, Y continuous. 


Both expressions in the right-hand side are familiar from Chapters 2 and 3, re- 
spectively. By the corresponding versions of the total expectation theorem. they 
are equal to E[X]. This brings us to the following conclusion. which is actually 
valid for every type of random variable Y (discrete. continuous. or mixed), as 
long as X has a well-defined and finite expectation E[X]. 


Law of Iterated Expectations: E|[E[X | Y]] = E[X]. 





The following examples illustrate how the law of iterated expectations fa- 
cilitates the calculation of expected values when the problem data include con- 
ditional probabilities. 


Example 4.16 (continued). Suppose that Y. the probability of heads for our 
coin is uniformly distributed over the interval [0.1]. Since E[X |Y] = nY and 
E[Y] = 1/2. by the law of iterated expectations, we have 


E[X] = E[E[X | Y] = E[nY] = nE[Y] = > 


Example 4.17. We start with a stick of length @ We break it at a point which 
is chosen randomly and uniformly over its length. and keep the piece that contains 
the left end of the stick. We then repeat the same process on the piece that we 
were left with. What is the expected length of the piece that we are left with after 
breaking twice? 

Let Y be the length of the piece after we break for the first time. Let X be 
the length after we break for the second time. We have E[X | Y] = Y/2. since the 
breakpoint is chosen uniformly over a piece of length Y. For a similar reason. we 
also have E[Y] = £/2. Thus. 


E[X] = E[E[X|Y]] = E B = EP = £, 


Example 4.18. Averaging Quiz Scores by Section. A class has n students 
and the quiz score of student i is r;. The average quiz score is 


n 
1 
T — — J Ti. 
n 
1=1 
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The students are divided into k disjoint subsets A;,....A,, and are accordingly 


assigned to different sections. We use n, to denote the number of students in 
section s. The average score in section s is 


The average score over the whole class can be computed by taking the average score 
ms of each section, and then forming a weighted average; the weight given to section 
s is proportional to the number of students in that section, and is n,/n. We verify 
that this gives the correct result: 


k k 
5. ns Y ns 1 
— — — 0€ nd M 
n n mns 25 


Ms = 
dul s=1 i€ Ag 
k 
1 Y 
= — Dy 
n 
s=] i€As 
Ti 
1 
= Ti 
n 


How is this related to conditional expectations? Consider an experiment in 
which a student is selected at random, each student having probability 1/n of being 
selected. Consider the following two random variables: 


X = quiz score of a student, 
Y = section of a student, (Y € (1,... .k]). 
We then have 
E[X] 2 m. 


Conditioning on Y = s is the same as assuming that the selected student is 
in section s. Conditional on that event, every student in that section has the same 
probability 1/n, of being chosen. Therefore. 


EIX IY =s) = = Y zi = m. 
* i£ As 


A randomly selected student belongs to section s with probability n; /n, i.e., P(Y = 
s) = n,/n. Hence, 


k k 
m = E[X] = E[E[X|Y]] = $` E[X|Y = PY = 9 = m, 


s=1 sæl 


Thus, averaging by section can be viewed as a special case of the law of iterated 
expectations. 
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Example 4.19. Forecast Revisions. Let Y be the sales of a company in the 
first semester of the coming year, and let X be the sales over the entire year. The 
company has constructed a statistical model of sales, and so the joint distribution 
of X and Y is assumed to be known. In the beginning of the year, the expected 
value E[X] serves as a forecast of the actual sales X. In the middle of the year. 
the first semester sales have been realized and the value of the random variable Y 
is now known. This places us in a new “universe,” where everything is conditioned 
on the realized value of Y. Based on the knowledge of Y, the company constructs 
a revised forecast of yearly sales, which is E[X |Y]. 

We view E[X | Y| — E[X] as the forecast revision, in light of the mid-year 
information. The law of iterated expectations implies that 


E[E[X | Y] - E[X]] = E[E[X | Y]] - E[X] = E[X] - E[X] = 0. 
This indicates that whilethe actual revision will usually be nonzero, in the beginning 
of the year we expect the revision to be zero. on the average. This is quite intuitive. 


Indeed. if the expected revision were positive. the original forecast should have been 
higher in the first place. 


We finally note an important property: for any function g, we have 
E|Xs(Y)|Y] = g(Y) E[X | Y]. 


This is because given the value of Y , g(Y ) is a constant and can be pulled outside 
the expectation; see also Problem 25. 


The Conditional Expectation as an Estimator 


If we view Y as an observation that provides information about .X, it is natural 
to view the conditional expectation, denoted 


X = E[X | Y]. 
as an estimator of X given Y. The estimation error 
X2X-X, 
is a random variable satisfying 
E[X |Y] = E[X - X |Y] =E[X |Y] - E[X| Y] = X - X - 0. 


Thus, the random variable E[X | Y] is identically zero: E[X | Y = y] = 0 for all 
values of y. By using the law of iterated expectations, we also have 


E(X] = E[E[X | Y]] = 0. 


This property is reassuring, as it indicates that the estimation error does not 
have a systematic upward or downward bias. 
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We will now show that X has another interesting property: it is uncor- 


~ 


related with the estimation error X. Indeed. using the law of iterated 
expectations, we have 


E[X X] = E[E[XX | Y]] = E[XE[X | Y]] = 
where the last two equalities follow from the fact that X is completely determined 
by Y. so that E n 
E[X X |Y] = XE[X |Y] = 0. 
It follows that 
cov(X. X) = E[X X] - EX] EX] = 0 - E(X]- 0 = 0, 
and X and X are uncorrelated. 
An important consequence of the fact cov(X, X ) = 0 is that by considering 
the variance of both sides in the equation X — X + X, we obtain 
var(X) = var(X) + var(X). 
This relation can be written in the form of a useful law. as we now discuss. 
The Conditional Variance 
We introduce the random variable 
var(X |Y) = E|(X - EX |Y])? | Y| = E[X?|Y]. 
This is the function of Y whose value is the conditional variance of X when Y 
takes the value y: . 
var(X |Y = y) = E[X?|Y =y]. 


Using the fact E[X] — 0 and the law of iterated expectations, we can write the 
variance of the estimation error as 


var(X) — E[X?] = E[E[X?]| Y]] = E[var(X | Y)]. 


and rewrite the equation var(X) = var(X) + var( X ) as follows. 


Law of Total Variance: var(X) = E[var(X | Y)] + var(E[X | Y]). 





The law of total variance is helpful in calculating variances of random 
variables by using conditioning. as illustrated by the following examples. 
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Example 4.16 (continued). We consider n independent tosses of a biased coin 
whose probability of heads, Y, is uniformly distributed over the interval (0, 1]. With 
X being the number of heads obtained, we have E[X | Y] = nY and var(X |Y) = 
nY(1-— Y). Thus, 


E[var(X | Y)] -E[nY (1 - Y) = ^(ElY] - E(Y?1) 
= n(E[Y] - va(Y) - (E[Y])?) = n (5 NE =) ==. 
Furthermore, 


var(E[X | Y]) = var(nY) TS 


Therefore, by the law of total variance, we have 


N 


var( X) = E[var(X | Y)] + var (ELX | Y]) = $ + a 


Example 4.17 (continued). Consider again the problem where we break twice a 

stick of length £ at randomly chosen points. Here Y is the length of the piece left 

after the first break and X is the length after the second break. We calculated the 

mean of X as £/4. We will now use the law of total variance to calculate var( X). 
Since X is uniformly distributed between 0 and Y , we have 


2 
var(X | Y) = > 


Thus, since Y is uniformly distributed between 0 and £, we have 


£ 
1 1 1 13. 2 
Epai] = 35 | ; d i E 
0 


We also have E[X | Y] = Y/2, so 


var(E[X | Y]) = var(Y/2) = laY)2i Ld 
4 14 12 48 
Using now the law of total variance, we obtain 
C e pg 
var(X) = E[var(X | Y)] + var(E[X | Y]) = a ae = aa 


Example 4.20. Averaging Quiz Scores by Section — Variance. The setting 
is the same as in Example 4.18 and we consider again the random variables 


X = quiz score of a student, 


Y = section of a student, (Y E {1,... ,k)). 
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Let n, be the number of students in section s, and let n be the total number of 
students. We interpret the different quantities in the formula 


var(X) = E|var(X | Y)] + var (E[X | X]: 
Here, var(X | Y = s) is the variance of the quiz scores within section s. Thus, 


k k 
E |var(X | Y)| = 5 P(Y = s)var(X | Y = s) = 59 7 var(X | 3 =s). 


aml Szl 


so that E|var(X | Y)| is the weighted average of the section variances, where each 
section is weighted in proportion to its size. 

Recall! that E[X | Y = s] is the average score in section s. Then, var (ELX | Y]) 
is a measure of the variability of the averages of the different sections. The law of 
total variance states that the total quiz score variance can be broken into two parts: 


(a) The average score variability E|var(X | Y)] within individual sections. 


4 


(b) The variability var(E(X | Y]) between sections. 


We have seen earlier that the law of iterated expectations can be used to 
break down complicated expectation calculations, by considering different cases. 
A similar method applies to variance calculations. 


Example 4.21. Computing Variances by Conditioning. Consider a contin- 
uous random variable X with the PDF given in Fig. 4.13. We define an auxiliary 
random variable Y as follows: 


l, Shas 
Y-[5 irl. 


Here, E[X | Y] takes the values 1/2 and 2, each with probability 1/2. Thus, the 
mean of E[X | Y] is 5/4. It follows that 


«iem 23-9 36-9 8 





Figure 4.13: The PDF in Example 4.21. 
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Conditioned on Y — 1 or Y — 2. X is uniformly distributed on an interval of 
length 1 or 2, respectively. Therefore, 


va(X |Y 2 1) =, var(X|¥ 22) 2 É, 
and 
1 1 1 4 5 
E[vr(X|Y)]] 2 2: 15 *5 12 > 24 


Putting everything together. we obtain 


var( X) = E|var(X | Y)] + var(E[X | Y]) = 3 + = = = 


We summarize the main points in this section. 


Properties of the Conditional Expectation and Variance 


E[X |Y = y] is a number whose value depends on y. 


E[X |Y] is a function of the random variable Y, hence a random vari- 
able. Its value is E[X | Y = y] whenever the value of Y is y. 


E[E(X|Y]| = E[X] (law of iterated expectations). 


E[X |Y = y| may be viewed as an estimate of X given Y = y. The 
corresponding error E[X | Y] — X is a zero mean random variable that 
is uncorrelated with E[X |Y ]. 


var(X | Y) is a random variable whose value is var(X | Y = y) whenever 
the value of Y is y. 


var(X) = E|var(X |Y)] + var(E[X |Y]) (law of total variance). 





4.44 TRANSFORMS 


In this section, we introduce the transform associated with a random variable. 
The transform provides us with an alternative representation of a probability 
law. It is not particularly intuitive, but it is often convenient for certain types 
of mathematical manipulations. 

The transform associated with a random variable X (also referred to as 
the associated moment generating function) is a function Mx (s) of a scalar 
parameter s, defined by 


Mx(s) = Eles* |]. 
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The simpler notation M(s) can also be used whenever the underlying random 
variable X is clear from the context. In more detail, when X is a discrete random 
variable, the corresponding transform is given by 


M(s) = X` e*px (), 


T 


while in the continuous case it is given byt 
oo 
M(s) = f est f x (x) dz. 
— 00 


Let us now provide some examples of transforms. 


Example 4.22. Let 
1/2, ifr22, 
ZA ifr = 3, 
1/3, if 5. 


The corresponding transform is 


sr 1 S 1 S 1 S 
M(s) = E[s J=5e° dece" hae: 


Example 4.23. The Transform Associated with a Poisson Random Vari- 
able. Let X be a Poisson random variable with parameter A: 


Ne~> 


z! 





px(r) = g 012. 


3 


The corresponding transform is 


M(s) = ye 


q! 


T The reader who is familiar with Laplace transforms may recognize that the 
transform associated with a continuous random variable is essentially the same as the 
Laplace transform of its PDF, the only difference being that Laplace transforms usually 
involve e^ ?? rather than e??. For the discrete case, a variable z is sometimes used in 
place of e° and the resulting transform 


M() = $ ^ z* px (2) 


r 


is known as the z-transform. However, we will not be using z-transforms in this book. 
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We let a = €?A and obtain 
xX af s 
M(s) 2e? J = e^el = et> m PO“), 


T! 
x=0 


Example 4.24. The Transform Associated with an Exponential Random 
Variable. Let X be an exponential random variable with parameter A: 


fx(z) =Ae™, r0. 


Then, 











The above calculation and the formula for Af(s) is correct only if the integrand 
e‘S~)© decays as x increases. which is the case if and only if s < A: otherwise, the 
integral is infinite. 


It is important to realize that the transform is not a number but rather a 
function of a parameter s. Thus. we are dealing with a transformation that starts 
with a function, e.g., a PDF, and results in a new function. Strictly speaking, 
M(s) is only defined for those values of s for which Efes* | is finite. as noted in 
the preceding example. 


Example 4.25. The Transform Associated with a Linear Function of 
a Random Variable. Let Mx(s) be the transform associated with a random 
variable X. Consider a new random variable Y = aX + b. We then have 


My (s) = E[e? 9*9] = e**E[e*9 X] = e" Mx (sa). 


For example, if X is exponential with parameter A = 1. so that Mx(s) = 1/(1— s), 
and if Y —2X +3. then 
1 


— 2s 





My(s) = e 1 


Example 4.26. The Transform Associated with a Normal Random Vari- 
able. Let X be a normal random variable with mean p and variance c?. To calcu- 
late the corresponding transform, we first consider the special case of the standard 
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normal random variable Y, where „u = 0 and c? = 1, and then use the formula 
derived in the preceding example. The PDF of the standard normal is 

1 2 
EE e" /2 
fv (y) Van , 


and the associated transform is 


oc 1 2 > 
My(s)= | — e7" e?! dy 
oo V 2m 


1 i 2 
= + | e /2)+8¥ dy 
WoT a 
oo 
2QQ,/2/ 1 f e G2 G3 23, 
V2T J _o 


2 1 a 2 
— 5/2 —(y—s)*/2 
=e —— e d 
V 2T [. d 


where the last equality follows by using the normalization property of a normal 
PDF with mean s and unit variance. 

A general normal random variable with mean p and variance c 
from the standard normal via the linear transformation 


? is obtained 


X —cY +p. 


2 
The transform associated with the standard normal is My(s) — e? /?, as verified 


above. By applying the formula of Example 4.25, we obtain 


2.2 
Mx(s) =e** My (sc) = ge earns. 


From Transforms to Moments 


The reason behind the alternative name “moment generating function” is that 
the moments of a random variable are easily computed once a formula for the 
associated transform is available. To see this, let us consider a continuous random 
variable X, and let us take the derivative of both sides of the definition 


oo 
M(s) = f est f x (x) dz, 
with respect to s. We obtain 


5M(s)= F exc 


= a £ ese fx(2)dz 


= f rest fx (z) dz. 
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This equality holds for all values of gt By considering the special case where 
s = 0, we obtain 

d oo 

— M(s) zii rfx(r)dr = E[X]. 


ds -— ER 


More generally, if we differentiate n times the function M(s) with respect to s, 
a similar calculation yields 








- f. z^ fy (rz) dz = E[X^7]. 
s=0 


— 00 


Example 4.27. We saw earlier (Example 4.22) that the PMF 
1/2, ifr = 2: 
px(x)= < 1/6, ifr =3, 
1/3, ifr=5. 
is associated with the transform 


M(s) = Ti + loss + xe 


6 
Thus, 
d 
EIX] = —M 
IX] = 3, K) s 
1 2s 1 3s 1 5s 
aE 9 243 2.5 
2 e +& e +3 e bs 
1 1 
=--2++--34+-:-5 
2 *6 
_19 
mh 
Also, 
2 d? 
E|X^| = —M 
[X*] ds NS 
1 2 1 3s 1 5s 
EI. S4—.9 — -25 
dd uu eg Sane | 
1 
2l. —-9+4+—-25 
2 4+ + 
1 
Sa 


1 This derivation involves an interchange of differentiation and integration. The 
interchange turns out to be justified for all of the applications to be considered in 
this book. Furthermore, the derivation remains valid for general random variables, 
including discrete ones. In fact, it could be carried out more abstractly, in the form 


d dax) pli | = sX 
1, M(3) = 3;Ele ]|-E[Se = E[Xe^^], 


leading to the same conclusion. 
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For an exponential random variable with PDF 
fx(z) 2aàe€ "*. m0. 


we found earlier (Example 4.24) that 





A 
M(s) = ——. 
I(s) a 
Thus. 
d A d? 2X 
qa (8) = (A- s) 4s Ms) = (A—s)3 
By setting s = 0. we obtain 
-1 af es 
E[X] = X E[X^] = XP 


which agrees with the formulas derived in Chapter 3. 


We close by noting two more useful and generic properties of transforms. 
For any random variable X. we have 


Mx(0) = E[e?X] = E[1] = 1, 
and if X takes only nonnegative integer values, then 


lim Mx(s) = P(X = 0) 


sa OS 


(see the end-of-chapter problems). 
Inversion of Transforms 


A very important property of the transform Mx (s) is that it can be inverted, 
i.e.. it can be used to determine the probability law of the random variable X. 
Some appropriate mathematical conditions are required, which are satisfied in 
all of our examples that niake use of the inversion property. The following is a 
more precise statement. Its proof is beyond our scope. 


Inversion Property 


The transform Mx (s) associated with a random variable X uniquely deter- 
mines the CDF of X, assuming that M x (s) is finite for all s in some interval 
[-a, a], where a is a positive number. 





There exist explicit formulas that allow us to recover the PMF or PDF of a 
random variable starting from the associated transform, but they are quite diffi- 
cult to use. In practice, transforms are usually inverted by "pattern matching,” 
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based on tables of known distribution-transform pairs. We will see a number of 
such examples shortly. 


Example 4.28. We are told that the transform associated with a random variable 
AN 1 1 1 1 
M S257" £f = 4s e 55 
(s) 49 oh gt ge be 


Since M(s) is a sum of terms of the form e*”, we can compare with the general 


formula 
M(s) = 5 e*px(2), 


r 


and infer that X is a discrete random variable. The different values that X can 
take can be read from the corresponding exponents, and are —1, 0, 4, and 5. The 
probability of each value z is given by the coefficient multiplying the corresponding 
e°7 term. In our case, 


1 

P(X = 5) = =. 

(X= 5) => 
Generalizing from the last example, the distribution of a finite-valued dis- 
crete random variable can be always found by inspection of the corresponding 


transform. The same procedure also works for discrete random variables with 
an infinite range, as in the example that follows. 


Example 4.29. The Transform Associated with a Geometric Random 
Variable. We are told that the transform associated with a random variable X is 
of the form p 
pe 
M(s) = —— =, 
VE p 
where p is a constant in the range 0 « p € 1. We wish to find the distribution of 
X. We recall the formula for the geometric series: 


1 
—— =l+ata’*+--., 
l-a 


which is valid whenever |a| < 1. We use this formula with a = (1 — p)e°, and for s 
sufficiently close to zero so that (1 — p)e? < 1. We obtain 


M(s) = pe° (1 + (1 — p)e' + (1 ~ pe^ (1 — pe +-+). 
As in the previous example, we infer that this is a discrete random variable that 
takes positive integer values. The probability P(X = k) is found by reading the 
coefficient of the term e**. In particular, P(X = 1) = p, P(X = 2) = p(1— p), and 
P(X-k-pü-pn'"'  k212.. 


We recognize this as the geometric distribution with parameter p. 
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Note that 
d pe? (1 — p)pe?* 
Be ee a cl A 
TS ae edens 


For s = 0, the right-hand side is equal to 1/p, which agrees with the formula for 
E[X] derived in Chapter 2. 


Example 4.30. The Transform Associated with a Mixture of Two Dis- 
tributions. The neighborhood bank has three tellers, two of them fast, one slow. 
The time to assist a customer is exponentially distributed with parameter à = 6 at 
the fast tellers, and A = 4 at the slow teller. Jane enters the bank and chooses a 
teller at random, each one with probability 1/3. Find the PDF of the time it takes 
to assist Jane and the associated transform. 








We have 
2 dga . | år 
fx(z) = $.6e79* + 2. 4e 7, r0. 
3 3 
Then, 
g 2 jets 
M(s) =] e°? (2o + -4e p dr 
o 3 3 
2 "s srp —6Or = sx, —4r 
=> 6 dz + = 4e "dr 
3 Jo 0 
2 6 1 4 
We nee ta mer (for s < 4). 
More generally, let X1,..., X4 be continuous random variables with PDFs 
fx,,--»:fxn- The value y of a random variable Y is generated as follows: an index 


i is chosen with a corresponding probability p;, and y is taken to be equal to the 
value of X;. Then, 


fy (y) m Pifx, (y) ee + pafxn(Yy); 
and 
My (s) = p1Mx,(s) +++: + pa Mx, (s). 


The steps in this problem can be reversed. For example, we may be given 
that the transform associated with a random variable Y is of the form 


1 1 ie 3 1 
2 2-s 4 1-s 
We can then rewrite it as 


1 
1-s' 


1 2 
4 2—s 








3 
Ts 


and recognize that Y is the mixture of two exponential random variables with 
parameters 2 and 1, which are selected with probabilities 1/4 and 3/4, respectively. 
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Sums of Independent Random Variables 


Transform methods are particularly convenient when dealing with a sum of ran- 
dom variables. The reason is that addition of independent random variables 
corresponds to multiplication of transforms, as we will proceed to show. This 
provides an often convenient alternative to the convolution formula. 

Let X and Y be independent random variables, and let Z = X + Y. The 
transform associated with Z is, by definition, 


Mz(s) = E[es7] = E[es(X*Y)] = E[esX es¥]. 


Since X and Y are independent, eX and esY are independent random variables, 
for any fixed value of s. Hence, the expectation of their product is the product 
of the expectations, and 


Mz(s) = E[esX]E[esY] = Mx (s) My (s). 


By the same argument, if X1,...,Xn is a collection of independent random 


variables, and 
Z-Xiqt:cct4 X 


then 
Mz(s) = Mxi(s) -- Mx, (s). 


Example 4.31. The Transform Associated with the Binomial. Let X;,..., Xn 
be independent Bernoulli random variables with a common parameter p. Then, 


Mx,(s) = (1 — p)e? + pe’® —-1-p-cpe, for all i. 


The random variable Z = X +---+ X, is binomial with parameters n and p. The 
corresponding transform is given by 


Mz(s) = (1 -p+ pe)". 


Example 4.32. The Sum of Independent Poisson Random Variables is 
Poisson. Let X and Y be independent Poisson random variables with means A 
and p, respectively, and let Z = X + Y. Then, 


Mx(s) = eX*-9, — My(s) = e" «7, 


and 
Mz(s) = Mx (s)Mv (s) = eel = gne, 


Thus, the transform associated with Z is the same as the transform associated 
with a Poisson random variable with mean A + u. By the uniqueness property of 
transforms, Z is Poisson with mean A + p. 
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Example 4.33. The Sum of Independent Normal Random Variables is 
Normal. Let X and Y be independent normal random variables with means pz, 
ly, and variances c2, o, respectively, and let Z = X +Y. Then, 


2:9 2.2 
Mx(s) = exp {= + uss}, My (s) = exp { Z> + us}, 


and 
2.3.5212 
+ 
Mw (s) = exp [EE oul + (Hz + i)s] 


We observe that the transform associated with Z is the same as the transform 
associated with a normal random variable with mean jz + 4, and variance o2 + c2. 
By the uniqueness property of transforms, Z is normal with these parameters, thus 
providing an alternative to the derivation described in Section 4.1, based on the 
convolution formula. 


Summary of Transforms and their Properties 


e The transform associated with a random variable X is given by 


X estpx( ) X discrete, 


T 


Mx(s) = EjesX] = 


oo 
f es fx (x) dz, X continuous. 


—OO 


e The distribution of a random variable is completely determined by the 
corresponding transform. 


e Moment generating properties: 


S Mx(s) _ a EIA <~Mx(s)| ^ -E(X"] 


s=0 


e If Y — aX +b, then My(s) = es* M x(as). 
e If X and Y are independent, then Mx+y (s) = Mx(s)My(s). 





We have obtained formulas for the transforms associated with a few com- 
mon random variables. We can derive such formulas with a moderate amount 
of algebra for many other distributions (see the end-of-chapter problems for the 
case of the uniform distribution). We summarize the most useful ones in the 
tables that follow. 
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Transforms for Common Discrete Random Variables 


Bernoulli(p) (k — 0,1) 
_ Jp, if k — 1, m ; 


Binomial(n,p) (k =0.1,...,n) 


n 


px(k) = (;,)erCa epe. Mx(s) = (1 — p + pes)”. 


Geometric(p) (k=1,2,...) 


px(k) = p(1 - pt, Mx(s) ^ i e 


Poisson(A) (k-0.1,...) 


e-^Ak 
Uniform(a,b) (k=a,a+l,..., 
sa(es(b—a-c-1) — 1 
Le Mx (s) = aa 
b-a+l (b — a + 1)(es — 1) 


Mx(s) = eMe*-0., 


px(k) = 





Transforms for Common Continuous Random Variables 


Uniform(a,b) (a< x <b) 
1 esè — esa 


Ix es cs di S MET 





Exponential(A) (x 0) 


ix@)aHAce™, Mx(s) = — (s< 2). 


Normal(g,o?) (—oo« z < oo) 





en (2-H)? 20? Mx (s) = e(97s"/2)+u8, 


1 
fx(z) = Jano 








4.5 
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Transforms Associated with Joint Distributions 


If two random variables X and Y are described by some joint distribution (e.g., a 
joint PDF), then each one is associated with a transform M x (s) or My (s). These 
are the transforms of the marginal distributions and do not convey information on 
the dependence between the two random variables. Such information is contained 
in a multivariate transform, which we now define. 

Consider n random variables X;,...,Xn related to the same experiment. 


Let s)..... Sn be scalar free parameters. The associated multivariate trans- 
form is a function of these n parameters and is defined by 


Xa 1 e Sn) = E |esi Xie X2]. 


The inversion property of transforms discussed earlier extends to the multivari- 
ate case. In particular. if Yi..... Yn is another set of random variables and if 
Mx... xa (81... Sn) = My,....Ya(81.....52) for all (s1..... Sn) belonging to 
some n-dimensional cube with positive volume. then the joint distribution of 
X 1,....Xn is the same as the joint distribution of Y;,..., Yn. 


SUM OF A RANDOM NUMBER OF INDEPENDENT RANDOM 
VARIABLES 


In our discussion so far of sums of random variables, we have always assumed 
that the number of variables in the sum is known and fixed. In this section. we 
will consider the case where the number of random variables being added is itself 
random. In particular. we consider the sum 


YeX bee Xn, 


where N is a random variable that takes nonnegative integer values, and X1, X2.... 
are identically distributed random variables. (If N — 0, we let Y — 0.) We as- 
sume that N. Xi. X5,... are independent, meaning that any finite subcollection 
of these random variables are independent. 

Let us denote by E[X] and var(X) the common mean and variance, re- 
spectively. of the X;. We wish to derive formulas for the mean, variance, and 
the transform of Y. The method that we follow is to first condition on the event 
(N 2 n). which brings us to the more familiar case of a fired number of random 
variables. 

Fix a nonnegative integer n. The random variable X, + --- + X, is inde- 
pendent of N and, therefore, independent of {N = n). Hence, 


E[Y|N = n] = E[Xi t XN | N =n] 
= E[X; +---+Xn|N =n] 
= E[Xi +--+ Xn] 
= nE[X]. 


Sec. 4.5 Sum of a Random Number of Independent Random Variables 241 
This is true for every nonnegative integer n, so 
E|Y | N] = NE[X]. 
Using the law of iterated expectations, we obtain 
E[Y] = E[E|Y | N]] = E[N E[X]] = E[N] E[X]. 
Similarly, 


var(Y | N = n) = var(X1 +--+ XN|N =n) 
= var(X1 +--+ + Xn) 
= nvar(X). 


Since this is true for every nonnegative integer n, the random variable var(Y | N) 
is equal to Nvar(X). We now use the law of total variance to obtain 


var(Y) = E[var(Y | N)] + var(E[Y | N]) 
= E[Nvar(X)]  var(N E[X]) 
= E[N]var(X) + (E[X])^ var(N). 


The calculation of the transform proceeds along similar lines. The trans- 
form associated with Y , conditional on N = n, is E[esY | N = n]. However, condi- 
tioned on N = n, Y is the sum of the independent random variables X4,..., Xn, 


and 
E[e** | N = n|2 E[es ---esXv | N = n] 


= Ef|esX:ı (i e8Xa8] 
= Efes*1] ay - E[esX«] 
= (Mx(s))". 
where Mx(s) is the transform associated with X;, for each i. Using the law of 


iterated expectations, the (unconditional) transform associated with Y is 


My (s) = ElesY] = E[E[esY | N]] = E[(Mx(s)) ] = Y (Mx (3))" pw (n). 


n=0 
Using the observation 
(Mx (s)) eae clog( Mx (s))" — enlog Mx (s), 


we have 
oc 


My (s) = Y en log Mx (s)py (n). 


n=0 
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Comparing with the formula 


Mx(s) = E[esN] = Y ee "pn (n). 


n==0 


we see that My (s) = My (log Mx(s)). i.e., My (s) is obtained from the formula 
for My(s). with s replaced by log Afx(s) or, equivalently, with e* replaced by 
Mx (s). 

Let us summarize the properties derived so far. 


Properties of the Sum of a Random Number of Independent Ran- 
dom Variables 


Let X1, X2,... be identically distributed random variables with mean E[X] 
and variance var(.X). Let N be a random variable that takes nonnegative in- 
teger values. We assume that all of these random variables are independent, 
and we consider the sum 


Y=X,+---+Xvw. 
Then: 
e E[Y] = E[N] E[X]. 
e var(Y) = E[N] var(X) + (E[X]) var(N). 
e We have 


My (s) = My (log Mx (s)). 


Equivalently, the transform My (s) is found by starting with the trans- 
form My(s) and replacing each occurrence of es with M x (s). 





Example 4.34. A remote village has three gas stations. Each gas station is open 
on any given day with probability 1/2. independent of the others. The amount of 
gas available in each gas station is unknown and is uniformly distributed between 0 
and 1000 gallons. We wish to characterize the probability law of the total amount 
of gas available at the gas stations that are open. 

The number N of open gas stations is a binomial random variable with p — 
1/2 and the corresponding transform is 


My(s) = (17 p. pe^ = E0 +e). 


The transform Mx(s) associated with the amount of gas available in an open gas 


station is T 
e Sl 


Mx(s) = 
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The transform associated with the total amount Y available is the same as Mn/(s), 
except that each occurrence of e? is replaced with M x (s), i.e., 


1 1000s _ ] 3 
My(s)= = (1+ | =). 
v(s)= 5 ( +( 1000s 


Example 4.35. Sum of a Geometric Number of Independent Exponen- 
tial Random Variables. Jane visits a number of bookstores, looking for Great 
Expectations. Any given bookstore carries the book with probability p, indepen- 
dent of the others. In a typical bookstore visited, Jane spends a random amount 
of time, exponentially distributed with parameter A, until she either finds the book 
or she determines that the bookstore does not carry it. We assume that Jane will 
keep visiting bookstores until she buys the book and that the time spent in each is 
independent of everything else. We wish to find the mean, variance, and PDF of 
the total time spent in bookstores. 

The total number N of bookstores visited is geometrically distributed with pa- 
rameter p. Hence, the total time Y spent in bookstores is the sum of a geometrically 


distributed number N of independent exponential random variables Xi, X2,.... We 
have iod 
E[Y] = E[N] E[X] = > Ps 


Using the formulas for the variance of geometric and exponential random variables, 
we also obtain 





var(Y) = E[N] var(X) + (E[X]) var(N) = 
In order to find the transform My (s), let us recall that 


A e? 
MxG)e ie MOS Ie 


Then, My (s) is found by starting with My (s) and replacing each occurrence of e? 
with Mx(s). This yields 





_ pMx(s) _ = 
ix elo Gl age YE 





which simplifies to 
pA 


M = 2 
y (s) pà—s 





We recognize this as the transform associated with an exponentially distributed 
random variable with parameter pA, and therefore, 


fv(y 2pe?",  y2QO 
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This result can be surprising because the sum of a fired number n of indepen- 
dent exponential random variables is not exponentially distributed. For example, 
if n = 2, the transform associated with the sum is (A/(A — s), which does not 
correspond to an exponential distribution. 


Example 4.36. Sum of a Geometric Number of Independent Geometric 
Random Variables. This example is a discrete counterpart of the preceding one. 
We let N be geometrically distributed with parameter p. We also let each random 
variable X; be geometrically distributed with parameter g. We assume that all of 
these random variables are independent. Let Y = Xj +---+ Xy. We have 


qe? 


osre D Mx(s) = EE 


^" 1-(1- p)es' 


To determine My (s), we start with the formula for My (s) and replace each occur- 
rence of e? with Mx(s). This yields 


pM x (s) 
M = EE EE EET 
9-7 1-ü-pMx(s) 
and, after some algebra, 
" pge* 
Mrle) = 1-7 — ager 


We conclude that Y is geometrically distributed, with parameter pq. 


4.6 SUMMARY AND DISCUSSION 


In this chapter, we have studied a number of advanced topics. We discuss here 
some of the highlights. 

In Section 4.1, we addressed the problem of calculating the PDF of a func- 
tion g(X) of a continuous random variable X. The concept of a CDF is very 
useful here. In particular, the PDF of g(X) is typically obtained by calculat- 
ing and differentiating the corresponding CDF. In some cases, such as when the 
function g is strictly monotonic, the calculation is facilitated through the use of 
special formulas. We also considered some examples involving a function g( X, Y ) 
of two continuous random variables. In particular, we derived the convolution 
formula for the probability law of the sum of two independent random variables. 

In Section 4.2, we introduced covariance and correlation, both of which 
are important qualitative indicators of the relationship between two random 
variables. The covariance and its scaled version, the correlation coefficient, are 
involved in determining the variance of the sum of dependent random variables. 
They also play an important role in the linear least mean squares estimation 
methodology of Section 8.4. 
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In Section 4.3, we reconsidered the subject of conditioning, with the aim 
of developing tools for computing expected values and variances. We took a 
closer look at the conditional expectation and indicated that it can be viewed 
as a random variable, with an expectation and variance of its own. We derived 
some related properties, including the law of iterated expectations, and the law 
of total variance. 

In Section 4.4, we introduced the transform associated with a random vari- 
able, and saw how such a transform can be computed. Conversely, we indicated 
that given a transform, the distribution of an associated random variable is 
uniquely determined. It can be found, for example, using tables of commonly 
occurring transforms. We have found transforms useful for a variety of purposes, 
such as the following. 


(a) Knowledge of the transform associated with a random variable provides a 
shortcut for calculating the moments of the random variable. 


(b) The transform associated with the sum of two independent random vari- 
ables is equal to the product of the transforms associated with each one of 
them. This property was used to show that the sum of two independent 
normal (respectively, Poisson) random variables is normal (respectively, 
Poisson). 


(c) Transforms can be used to characterize the distribution of the sum of a 
random number of random variables (Section 4.5), something which is often 
impossible by other means. 


Finally, in Section 4.5, we derived formulas for the mean, the variance, and 
the transform of the sum of a random number of random variables, by combining 
the methodology of Sections 4.3 and 4.4. 
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PROBLEMS 


SECTION 4.1. Derived Distributions 


Problem 1. If X is a random variable that is uniformly distributed between —1 and 
1, find the PDF of ,/|X| and the PDF of -In |X|. 


Problem 2. Find the PDF of e* in terms of the PDF of X. Specialize the answer 
to the case where X is uniformly distributed between 0 and 1. 


Problem 3. Find the PDFs of |X|!/? and |X|!/4 in terms of the PDF of X. 


Problem 4. The metro train arrives at the station near your home every quarter 
hour starting at 6:00 a.m. You walk into the station every morning between 7:10 and 
7:30 a.m., with the time in this interval being a random variable with given PDF (cf. 
Example 3.14, in Chapter 3). Let X be the elapsed time, in minutes, between 7:10 and 
the time of your arrival. Let Y be the time that you have to wait until you board a 
train. Calculate the CDF of Y in terms of the CDF of X and differentiate to obtain a 
formula for the PDF of Y. 


Problem 5. Let X and Y be independent random variables, uniformly distributed 
in the interval [0,1]. Find the CDF and the PDF of |X — Y |. 


Problem 6. Let X and Y be the Cartesian coordinates of a randomly chosen point 
(according to a uniform PDF) in the triangle with vertices at (0, 1), (0, —1), and (1,0). 
Find the CDF and the PDF of |X — Y |. 


Problem 7. Two points are chosen randomly and independently from the interval 
[0.1] according to a uniform distribution. Show that the expected distance between 
the two points is 1/3. 


Problem 8. Find the PDF of Z = X +Y, when X and Y are independent exponential 
random variables with common parameter A. 


Problem 9. Consider the same problem as in Example 4.9, but assume that the 
random variables X and Y are independent and exponentially distributed with different 
parameters A and yp, respectively. Find the PDF of X — Y. 


Problem 10. Let X and Y be independent random variables with PMFs 


1/2, ify —0, 

_ J 1/3, ifz-21,2,3, _ ) 1/3, if y=1, 

pru = { 0, otherwise, Py(y) = 1/6, ify=2, 
0, otherwise. 


Find the PMF of Z = X + Y, using the convolution formula. 
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Problem 11. Use the convolution formula to establish that the sum of two indepen- 
dent Poisson random variables with parameters A and p. respectively, is Poisson with 
parameter A + p. 


Problem 12. The random variables X, Y. and Z are independent and uniformly 
distributed between zero and one. Find the PDF of X + Y +Z. 


Problem 13. Consider a PDF that is positive only within an interval [a,b] and is 
symmetric around the mean (a + b)/2. Let X and Y be independent random variables 
that both have this PDF. Suppose that you have calculated the PDF of X + Y. How 
can you easily obtain the PDF of X — Y? 


Problem 14. Competing exponentials. The lifetimes of two light bulbs are 
modeled as independent and exponential random variables X and Y, with parameters 
A and p, respectively. The time at which a light bulb first burns out is 


Z = min(X.Y ). 
Show that Z is an exponential random variable with parameter à + p. 


Problem 15.* Cauchy random variable. 


(a) Let X be a random variable that is uniformly distributed between —1/2 and 1/2. 
Show that the PDF of Y = tan(7X) is 


1 
fv(y == —oc« y « oc. 


(iepy2): 
(Y is called a Cauchy random variable.) 


(b) Let Y be a Cauchy random variable. Find the PDF of the random variable X, 
which is equal to the angle between —7/2 and 7/2 whose tangent is Y. 


Solution. (a) We first note that Y is a continuous, strictly monotonically increasing 
function of X. which takes values between —oc and oc. as X ranges over the interval 
[-1/2,1/2]. Therefore, we have for all scalars y. 


Fy(y)=P(Y <y)= P(tan(7X) < y) = P(vX < tan! y) = Z tan”! yt a 


where the last equality follows using the CDF of X . which is uniformly distributed in the 
interval [-1/2, 1/2]. Therefore, by differentiation, using the formula d/dy(tan~? y) = 
1/(1 + y?). we have for all y. 


1 
fr(y) = m1 +y) 


(b) We first compute the CDF of X and then differentiate to obtain its PDF. We have 
for -7/2€ rz € m/2. 
P(X < z) = P(tan ! Y € z) 
= P(Y < tanz) 
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1 tanz 1 
E ————d 
Ju 143? 7 


tanr 


Il 
| 
ct 
o 
5 

l 
e 





— Oc 


(+3) 
T 2p 
For x < —7/2, we have P(X < x) = 0, and for 7/2 < z, we have P(X < xz) = 1. 


0, 
Taking the derivative of the CDF P(X < zx), we find that X is uniformly distributed 
on the interval [—7/2, 7/2]. 


Note: An interesting property of the Cauchy random variable is that it satisfies 


oc 0 
y y 
Say sss ——3 — dy = oo, 
fl x +y)” ] ^ = 


as can be easily verified. As a result. the Cauchy random variable does not have a well- 
defined expected value, despite the symmetry of its PDF around 0; see the footnote in 
Section 3.1 on the definition of the expected value of a continuous random variable. 


Problem 16.* The polar coordinates of two independent normal random 
variables. Let X and Y be independent standard normal random variables. The pair 
(X,Y) can be described in polar coordinates in terms of random variables R > 0 and 
O € [0,27], so that 

X = RcosO, Y = Rsin9. 


(a) Show that O is uniformly distributed in [0, 27], that R has the PDF 
-r? [2 
fn(r) 2 re ; r 20, 
and that R and © are independent. (The random variable R is said to have a 
Rayleigh distribution.) 


(b) Show that R? has an exponential distribution with parameter 1/2. 


Note: Using the results in this problem, we see that samples of a normal random vari- 
able can be generated using samples of independent uniform and exponential random 
variables. 


Solution. (a) The joint PDF of X and Y is 
1 552:2:22 
fy (x.y) = fx(z)fv(y) = 5-e 0 YO”. 
We first find the joint CDF of R and ©. Fix some r > 0 and some 6 € [0.27]. and 
let A be the set of points (z, y) whose polar coordinates (7,6) satisfy 0 < r X r and 
0 € 0 € 0; note that the set A is a sector of a circle of radius r. with angle 0. We have 
Fne(r,9 =P(R <r, © x 6) =P((X,Y) € A) 


6 T 
= Ex e Gon dz dy = L e^ Pg qr di, 
2T 2" Jo Jo 


(r.y)€ A 
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where the last equality is obtained by transforming to polar coordinates. We then 
differentiate, to find that 


2 
fn.e(r.0) = P PRO (r, )= ze r 2 0, 0 € [0,27]. 

Thus, 

2T 

far) = f fro(r,0)d)=re"?, — r0. 

0 

Furthermore, 
r, 0 1 
foin(@|r) = IBS E, aequ] 


Since the conditional PDF fe|a of © is unaffected by the value of the conditioning 
variable R. it follows that it is also equal to the unconditional PDF fe. In particular, 
fn.e(r.0) = fr(r) fe(0), so that R and © are independent. 


(b) Let t > 0. We have 
P(R? >t) = P(R > vt) = / eer" Ra. xi e "duce", 
vt t/2 
where we have used the change of variables u = r?/2. By differentiating, we obtain 


] x 
fra(t) = 5e Me. B 


SECTION 4.2. Covariance and Correlation 


Problem 17. Suppose that X and Y are random variables with the same variance. 
Show that X — Y and X + Y are uncorrelated. 


Problem 18. Consider four random variables. W. X. Y, Z, with 
E[W] = E[X] = E[Y] = E(Z] = 0, 


var(W) = var(X) = var(Y) = var(Z) = 1. 


and assume that W. X, Y. Z are pairwise uncorrelated. Find the correlation coefficients 
p( R. S) and p(R. T). where R2W - X,S— X - Y.and T =Y +Z. 


Problem 19. Suppose that a random variable X satisfies 
E[X]=0. E[X?]- 1 E[xX*]=0. E[X?|-3. 


and let 
Y =a +bX +cX?. 


Find the correlation coefficient p(X. Y ). 
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Problem 20.* Schwarz inequality. Show that for any random variables X and Y. 
we have 


(E[XY])° < E[X?] E|Y?]. 


Solution. We may assume that E[Y?] 4 0; otherwise, we have Y = 0 with probability 
1. and hence E[XY] = 0. so the inequality holds. We have 


0«E (x- a) | 





2 ,QE[XY] (EIXY])' 
-E [x 2a XY + era)? Y | 

— gue - 2EIXY] (EIXY]' v2 
= E[X?] - 2 pyy EY! + mira)? E[Y?| 
= E(X?] EP 


ie. (E[XY])? < E[X?] E[Y?]. 


Problem 21.* Correlation coefficient. Consider the correlation coefficient 


cov( X.Y) 
/ var( X )var(Y ) 


of two random variables X and Y that have positive variances. Show that: 


pe(X.Y)— 


a) lo( X, Y )| € 1. Hint: Use the Schwarz inequality from the preceding problem. 

(b) If Y — E[Y] is a positive (or negative) multiple of X — E[X], then p(X, Y) = 1 
for p(X. Y) = —1, respectively]. 

(c) If o(X, Y) =1 [or p(X, Y) = —1]. then, with probability 1, Y — E[Y] is a positive 
(or negative. Mala ut multiple of X — E[X ]. 


Solution. (a) Let X = X — E[X] and Y = Y — E[Y]. Using the Schwarz inequality, 
we get 


» — (EXY]) 
(CX) = gara S 
and hence |p( X, Y )| < 1. 
(b) If Y = aX, then 
p(X. Y) E[XaX] EJ 
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(c) If (o(X, Y)) = 1, the calculation in the solution of Problem 20 yields 











~~ 2 ~ ~ 2 
5 ` z E|XY 
Ex BET. - pp?) — EI) 
E[Y?] E[Y?] 
- 2 
= E(X*](1 - (a(x, Y)) ) 
= 0. 
Thus, with probability 1, the random variable 
katey 
E[Y?] 
is equal to zero. It follows that, with probability 1, 
X= Ext ly a(d A lx, Y)Y, 
E[Y?] E[Y?] 


i.e., the sign of the constant ratio of X and Y is determined by the sign of p(X, Y). 


SECTION 4.3. Conditional Expectation and Variance Revisited 


Problem 22. Consider a gambler who at each gamble either wins or loses his bet with 
probabilities p and 1 — p, independent of earlier gambles. When p » 1/2, a popular 
gambling system, known as the Kelly strategy, is to always bet the fraction 2p — 1 of 
the current fortune. Compute the expected fortune after n gambles, starting with z 
units and employing the Kelly strategy. 


Problem 23. Pat and Nat are dating, and all of their dates are scheduled to start at 
9 p.m. Nat always arrives promptly at 9 p.m. Pat is highly disorganized and arrives at 
a time that is uniformly distributed between 8 p.m. and 10 p.m. Let X be the time in 
hours between 8 p.m. and the time when Pat arrives. If Pat arrives before 9 p.m., their 
date will last exactly 3 hours. If Pat arrives after 9 p.m., their date will last for a time 
that is uniformly distributed between 0 and 3 — X hours. The date starts at the time 
they meet. Nat gets irritated when Pat is late and will end the relationship after the 
second date on which Pat is late by more than 45 minutes. All dates are independent 
of any other dates. 


(a) What is the expected number of hours Nat waits for Pat to arrive? 

(b) What is the expected duration of any particular date? 

(c) What is the expected number of dates they will have before breaking up? 
Problem 24. A retired professor comes to the office at a time which is uniformly 
distributed between 9 a.m. and 1 p.m., performs a single task, and leaves when the task 
is completed. The duration of the task is exponentially distributed with parameter 


A(y) = 1/(5 — y), where y is the length of the time interval between 9 a.m. and the 
time of his arrival. 
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(a) What is the expected amount of time that the professor devotes to the task? 
(b) What is the expected time at which the task is completed? 


(c) The professor has a Ph.D. student who on a given day comes to see him at a 
time that is uniformly distributed between 9 a.m. and 5 p.m. If the student does 
not find the professor, he leaves and does not return. If he finds the professor, he 
spends an amount of time that is uniformly distributed between 0 and 1 hour. 
The professor will spend the same total amount of time on his task regardless of 
whether he is interrupted by the student. What is the expected amount of time 
that the professor will spend with the student and what is the expected time at 
which he will leave his office? 


Problem 25.* Show that for a discrete or continuous random variable X, and any 
function g(Y ) of another random variable Y, we have E[Xg(Y) | Y] = g(Y) E[X | Y]. 


Solution. Assume that X is continuous. From a version of the expected value rule for 
conditional expectations given in Chapter 3, we have 


oo 


E(Xg(Y)lY = y] = / TT 


EU) / shee tide 
= gly) E[X | Y = y]. 


This shows that the realized values E[Xg(Y)|Y = y] and g(y)E[X|Y = y] of the 
random variables E[Xg(Y )| Y] and g(Y)E[X | Y] are always equal. Hence these two 
random variables are equal. The proof is similar if X is discrete. 


Problem 26.* Let X and Y be independent random variables. Use the law of total 
variance to show that 


var(XY) = (E[X]) ^ var(Y) + (E[Y]) var(X) + var(X)var(Y ). 


Solution. Let Z = XY. The law of total variance yields 


var(Z) — var(E[Z | X]) + E[var(Z | x)]. 


We have 
E[Z | X]  E[XY | X] = XE[Y], 
so that i 
var (E[Z | X]) = var(XE[Y]) = (E[Y]) var(X). 
Furthermore, 
var(Z | X) = var(XY | X) = X?var(Y | X) = X?var(Y ), 
so that 


E[var(Z | X)] = E[X?]var(Y) = (E[X]) var(Y) + var( X)var(Y ). 
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Combining the preceding relations, we obtain 


var(XY) = (E[X])*var(Y) + (E[Y]) ‘var(X) + var( X )var(Y ). 


Problem 27.* We toss n times a biased coin whose probability of heads, denoted by 
q, is the value of a random variable Q with given mean p and positive variance a”. Let 
X, be a Bernoulli random variable that models the outcome of the ith toss (i.e., Xi = 1 
if the ith toss is a head). We assume that Xi,..., Xn are conditionally independent, 
given Q — q. Let X be the number of heads obtained in the n tosses. 


(a) Use the law of iterated expectations to find E[X,] and E[X]. 
(b) Find cov(Xi, Xj). Are Xi,..., Xn independent? 


(c) Use the law of total variance to find var(X). Verify your answer using the co- 
variance result of part (b). 


Solution. (a) We have, from the law of iterated expectations and the fact E[X; | Q] 2 Q, 
E[X.] = E[E[X, | Q]] = EIQ] = u. 
Since X = X1 +-:-+ Xn, it follows that 


E[X| = E[Xi] + +--+ E[Xn] = np. 


(b) We have, for i # j, using the conditional independence assumption, 
E[X. X, |Q] = E[X; |Q] E[X; |Q] = Q*, 


and 
E[X, X;] = E[E[X.X; | Q]] = EIQ"]. 


Thus, 
cov(Xi, Xj) = E(X.X5] — EX] E[Xj] = B[Q?] - p? = 0?. 


Since cov(Xi, X;) > 0, X1,...,Xn are not independent. 
Also, for i = j, using the observation that X? = X,, 


var(X;) = E[X?] - (E[X;]) 
= E[Xj] - (E[Xi])’ 
=u- p. 
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(c) Using the law of total variance, and the conditional independence of X;,..., Xn, 
we have 


var( X = E[var( (X|Q) ] + var (E[X | Q]) 
= E[var( (Xi +: -» + Xn |Q)] + var(E[Xi +--+ Xn |Q]) 
- E[nQ( (1-Q Q)] + var(nQ) 
= nE[Q — Q7] + n’var(Q) 
- n(u - p°- o?) c iie? 


n(u — p?) - n(n - 1)o?. 


To verify the result using the covariance formulas of part (b), we write 


var(X) = var(X1 +--+ Xn) 
= Y var(X:) + S cov( Xi, Xj) 
i=l (G3) 1x3) 
= nvar( Xi) + n(n — l)cov( Xi, X2) 
= n(u — p?) + n(n — 1)c? 


Problem 28.* The Bivariate Normal PDF. The (zero mean) bivariate normal 
PDF is of the form 


—q(z,y) 


fx,v(z,y) = ce 


where the exponent term q(z, y) is a quadratic function of x and y, 





2 2 
I — 2p + z 
c2 Or0y Oy 
g(x,y) = "Ode 


Oz and c, are positive constants, p is a constant that satisfies -1 < p < 1, and c is a 
normalizing constant. 


(a) By completing the square, rewrite q(x, y) in the form (ox — By)? + yy”, for some 
constants a, B, and y. 


(b) Show that X and Y are zero mean normal random variables with variance o2 
and Oe respectively. 


(c) Find the normalizing constant c. 


(d) Show that the conditional PDF of X given that Y = y is normal, and identify 
its conditional mean and variance. 


(e) Show that the correlation coefficient of X and Y is equal to p. 
(f) Show that X and Y are independent if and only if they are uncorrelated. 


(g) Show that the estimation error E[X |Y] — X is normal with mean zero and 
variance (1 — p”)o2, and is independent from Y. 
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Solution. (a) We can rewrite q(z, y) in the form 


q(z,y) = a (x. y) + ax(v), 
where i j 
= 1 T. ci 2) =- s 
qı(z, y) = 2 — pi) ( p . and q2(y) = 205 


Ox Oy 


(b) We have 


oc oc 
fv(y) = Ji e (7967420) dr = eo [ e 1G dr. 


oc —oc 


Using the change of variables 


we obtain 
oc oc 2 
f eae de s yi A | e" "dq, = oz V1 — p? v2n. 
—oo —OC 


Thus, 
2 
fv(y) = cos /1— p? v2re * fog. 


We recognize this as a normal PDF with mean zero and variance o. The result for 
the random variable X follows by symmetry. 


(c) The normalizing constant for the PDF of Y must be equal to 1/(V 27 oy). It follows 


that 
1 
cor Vl—p?v2- = ` 
S f V 2T Oy 





which implies that 


(d) Since 
fx.v(z,y)- 1 e n g-a2(y) 
' 27O0z0y 1 X. p? 
and i 
fr(y) = e 0) 
27 Oy 
we obtain 


_ jfxv(sy) _ 1 (z — poxy/oy)” 
fele PES aga I PA aay 
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For any fixed y, we recognize this as a normal PDF with mean (po,/oy)y, and variance 
c2(1 — p?). In particular, E[X | Y = y] = (po02/oy)y. and E[X | Y] = (po /oy)Y. 


(e) Using the expected value rule and the law of iterated expectations, we have 
E(XY] = E[E(XY |Y]] 
= E[Y E[X |Y]] 
= E[Y(p02/oy)¥] 
m i 2 
Eos E[Y'] 
= po4;oy,. 
Thus, the correlation coefficient p(X. Y) is equal to 


p(X,Y) = cov(X. Y) B E[XY|] 
adu Or0y 





(f) If X and Y are uncorrelated, then p = 0, and the joint PDF satisfies fx.» (£, y) = 
fx (z)fv (y), so that X and Y are independent. Conversely, if X and Y are independent, 
then they are automatically uncorrelated. 


(g) From part (d). we know that conditioned on Y — y, X is normal with mean 
E[X | Y- y] and variance (1— p^)a2. Therefore, conditioned on Y = y, the estimation 
error X = E[X |Y = y] — X is normal with mean zero and variance (1 — p?)c2, i.e.. 


fail) = n P RU gt 


Since the conditional PDF of X does not depend on the value y of Y. it follows that 
X is independent of Y, and the above conditional PDF is also the unconditional PDF 
of X. 


SECTION 4.4. Transforms 


Problem 29. Let X be a random variable that takes the values 1. 2, and 3, with the 
following probabilities: 


P(X =1)= P(X = 2) = P(X = 3) = 


1 1 1 

2" 4° 4 

Find the transform associated with X and use it to obtain the first three moments, 
E[X]. E[X?], E[X?]. 


Problem 30. Calculate E[X?] and E[X*] for a standard normal random variable X. 


Problem 31. Find the third, fourth, and fifth moments of an exponential random 
variable with parameter A. 
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Problem 32. A nonnegative integer-valued random variable X has one of the follow- 
ing two expressions as its transform: 
e$— 
1. M(s) = e?* "ey 
e? 
2. M(s) =e" -V. 
(a) Explain why one of the two cannot possibly be the transform. 


(b) Use the true transform to find P(X = 0). 


Problem 33. Find the PDF of the continuous random variable X associated with 
the transform " 2 3 


1 
ames 








M(s) = 


Problem 34. A soccer team has three designated players who take turns striking 
penalty shots. The ith player has probability of success p,, independent of the successes 
of the other players. Let X be the number of successful penalty shots after each player 
has had one turn. Use convolution to calculate the PMF of X. Confirm your answer 
by first calculating the transform associated with X and then obtaining the PMF from 
the transform. 


Problem 35. Let X be a random variable that takes nonnegative integer values, and 
is associated with a transform of the form 
3 + 4e?* + 2e” 


3—es ? 


where c is some scalar. Find E[X], px(1), and E[X | X # 0]. 


Mx(s) 2c: 


Problem 36. Let X. Y, and Z be independent random variables, where X is Bernoulli 
with parameter 1/3, Y is exponential with parameter 2, and Z is Poisson with param- 
eter 3. 


(a) Consider the new random variable U = XY + (1 — X)Z. Find the transform 
associated with U. 


(b) Find the transform associated with 2Z + 3. 
(c) Find the transform associated with Y + Z. 


Problem 37. A pizza parlor serves n different types of pizza, and is visited by a 
number K of customers in a given period of time, where K is a nonnegative integer 
random variable with a known associated transform Mx(s) = E[e?*]. Each customer 
orders a single pizza. with all types of pizza being equally likely, independent of the 
number of other customers and the types of pizza they order. Give a formula, in terms 
of Mx(-). for the expected number of different types of pizzas ordered. 


Problem 38.* Let X bea discrete random variable taking nonnegative integer values. 
Let M (s) be the transform associated with X. 


(a) Show that 
P(X =0)= lim M(s). 


$— -—0oo 
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(b) Use part (a) to verify that if X is a binomial random variable with parameters 
n and p, we have P(X = 0) = (1— p)". Furthermore, if X is a Poisson random 
variable with parameter A. we have P(X = 0) = e^?. 


(c) Suppose that X is instead known to take only integer values that are greater 
than or equal to a given integer k. How can we calculate P(X = k) using the 
transform associated with X? 


Solution. (a) We have 
M(s) = M P(X = ke". 
k=0 


As s — —ox, all the terms e** with k > 0 tend to 0, so we obtain lim, 4; M(s) = 
P(X = 0). 


(b) In the case of the binomial, we have from the transform tables 
M(s) = (1- p + pe^)", 
so that lims—-œ M(s) = (1 — p)". In the case of the Poisson, we have 
M(s) = eX 79, 


so that lims—-ə M(s) = e ^. 


(c) The random variable Y — X — k takes only nonnegative integer values and the 
associated transform is My (s) = e **M(s) (cf. Example 4.25). Since P(Y = 0) = 
P(X =k), we have from part (a), 


Problem 39.* Transforms associated with uniform random variables. 


(a) Find the transform associated with an integer-valued random variable X that is 
uniformly distributed in the range {a,a+1,...,b}. 


(b) Find the transform associated with a continuous random variable X that is uni- 
formly distributed in the range [a,b]. 


Solution. (a) The PMF of X is 


1 
—————, ifk=a, 1....,6, 
nib neni ; LM 
0. otherwise. 


The transform is 
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b 
Xa 
b-a+l 
k=a 


e?^ b—a 
sk 
= ———— y e 
b-a+l De 
k=0 
sa es(b-at}) =] 


VENE ae 
^" b-atl es —1 


(b) We have 


e? — eg 


b sr 
3X] _ € = 
me) =Ble"*| = | pog up 


Problem 40.* Suppose that the transform associated with a discrete random variable 
X has the form 
A(e?) 


M(s) = —— 

(8) = Fay: 
where A(t) and B(t) are polynomials of the generic variable t. Assume that A(t) and 
B(t) have no common roots and that the degree of A(t) is smaller than the degree of 
B(t). Assume also that B(t) has distinct, real, and nonzero roots that have absolute 
value greater than 1. Then it can be seen that M(s) can be written in the form 


a1 Gm 
M(s) = — +: + ——.. 
(s) louer 1—r4e? 
where 1/r,...,1/rm are the roots of B(t) and the a; are constants that are equal to 
lim,, , 1 (1— rie?) M(s), i= 1,...,m. 


Ti 


(a) Show that the PMF of X has the form 


m 


X ar}, if k —0,1,..., 
P(X =k)= 


t=1 


0, otherwise. 


Note: For large k, the PMF of X can be approximated by ari , where £ is the 
index corresponding to the largest |r;| (assuming 7 is unique). 


(b) Extend the result of part (a) to the case where M (s) = e A(e*)/ B(e*) and b is 
an integer. 


Solution. (a) We have for all s such that |r;|e? < 1 


1 
1 - ries 


=1 re? pret us, 


Therefore, 


M(s) = S + S e? + (Sent) e?* Te 
i=l i=l =] 


i= 
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and by inverting this transform, we see that 


P(X-k)- Y aire 


1-1 
fork > 0, and P(X = k) = 0 fork < 0. Note that if the coefficients a, are nonnegative, 
this PMF is a mixture of geometric PMFs. 


(b) In this case, Af(s) corresponds to the translation by b of a random variable whose 
transform is A(e*)/B(e*) (cf. Example 4.25), so we have 


m 

ari, fk=bbth s 
P(X = k) = = 

0, otherwise. 


SECTION 4.4. Sum of a Random Number of Independent Random 
Variables 


Problem 41. At a certain time, the number of people that enter an elevator is a 
Poisson random variable with parameter A. The weight of each person is independent 
of every other person's weight, and is uniformly distributed between 100 and 200 lbs. 
Let X; be the fraction of 100 by which the ith person exceeds 100 lbs, e.g., if the 7^ 
person weighs 175 lbs., then X; = 0.75. Let Y be the sum of the Xi. 


(a) Find the transform associated with Y. 
(b) Use the transform to compute the expected value of Y. 
(c) Verify your answer to part (b) by using the law of iterated expectations. 


Problem 42. Construct an example to show that the sum of a random number of 
independent normal random variables is not normal (even though a fixed sum is). 


Problem 43. A motorist goes through 4 lights, each of which is found to be red with 
probability 1/2. The waiting times at each light are modeled as independent normal 
random variables with mean 1 minute and standard deviation 1/2 minute. Let X be 
the total waiting time at the red lights. 


(a) Use the total probability theorem to find the PDF and the transform associated 
with X, and the probability that X exceeds 4 minutes. Is X normal? 


(b) Find the transform associated with X by viewing X as a sum of a random number 
of random variables. 


Problem 44. Consider the calculation of the mean and variance of a sum 
Y2Xict:-c- Xn, 
where N is itself a sum of integer-valued random variables, i.e., 


N=K,+--:+ Ky. 
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Here N, M, Kı. K2,.... X1, X2,... are independent random variables, N, M, Ki, K»,... 
are integer-valued and nonnegative, Ki, K2,...are identically distributed with common 
mean and variance denoted E[K] and var( K), and X1, X2, ... are identically distributed 
with common mean and variance denoted E[X] and var( X). 


(a) Derive formulas for E[N] and var(N) in terms of E[M], var(M), E[K], var(K). 


(b) Derive formulas for E[Y] and var(Y) in terms of E[M], var( M), E[K], var(K), 
E[X]. var(X). 


(c) A crate contains M cartons, where M is geometrically distributed with parame- 
ter p. The ith carton contains K; widgets, where K; is Poisson-distributed with 
parameter jj. The weight of each widget is exponentially distributed with pa- 
rameter A. All these random variables are independent. Find the expected value 
and variance of the total weight of a crate. 


Problem 45.* Use transforms to show that the sum of a Poisson-distributed number 
of independent. identically distributed Bernoulli random variables is Poisson. 


Solution. Let N be a Poisson-distributed random variable with parameter A. Let Xi, 
i= 1,..., N, be independent Bernoulli random variables with parameter p, and let 


L=X,+---+Xn 


be the corresponding sum. The transform associated with L is found by starting with 
the transform associated with N, which is 


My(s) = pue 
and replacing each occurrence of e* by the transform associated with X;, which is 
Mx(s) =1—p+pe’. 


We obtain : : 
M1(s) = eh 0-7P*pe -1) — QAp(e^-1) 


This is the transform associated with a Poisson random variable with parameter Ap. 
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In this chapter, we discuss some fundamental issues related to the asymptotic 
behavior of sequences of random variables. Our principal context involves a 
sequence X1, X2,... of independent identically distributed random variables with 
mean p and variance c?. Let 


Sn = Xit Xn 


be the sum of the first n of them. Limit theorems are mostly concerned with the 
properties of S» and related random variables as n becomes very large. 


Because of independence, we have 
var($4) = var(X1) +--+ + var(X4) = no?. 


Thus, the distribution of S, spreads out as n increases, and cannot have a 
meaningful limit. The situation is different if we consider the sample mean 


Xit +Xn S. 


Mn = . 
n n 


A quick calculation yields 


o2 
E[M,] = p, var( Mn) = x 


In particular, the variance of M,, decreases to zero as n increases, and the bulk 
of the distribution of M; must be very close to the mean pu. This phenomenon 
is the subject of certain laws of large numbers, which generally assert that the 
sample mean Mn (a random variable) converges to the true mean u (a num- 
ber), in a precise sense. These laws provide a mathematical basis for the loose 
interpretation of an expectation E[X] = 4 as the average of a large number of 
independent samples drawn from the distribution of X. 

We will also consider a quantity which is intermediate between Sn and Mn. 


We first subtract ny from Sn, to obtain the zero-mean random variable Sn — ny 
and then divide by oyn, to form the random variable 


Sa Cl nu 
Zn = oyn ` 
It can be seen that 
E[Zn] = 0. var(Zn)=1 


Since the mean and the variance of Zn remain unchanged as n increases, its 
distribution neither spreads, nor shrinks to a point. The central limit theorem 
is concerned with the asymptotic shape of the distribution of Zn and asserts that 
it becomes the standard normal distribution. 
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Limit theorems are useful for several reasons: 


(a) Conceptually. they provide an interpretation of expectations (as well as 
probabilities) in terms of a long sequence of identical independent experi- 
ments. 


(b) They allow for an approximate analysis of the properties of random vari- 
ables such as Sn. This is to be contrasted with an exact analysis which 
would require a formula for the PMF or PDF of Sn. a complicated and 
tedious task when n is large. 


(c) They play a major role in inference and statistics, in the presence of large 
data sets. 


5.1 MARKOV AND CHEBYSHEV INEQUALITIES 


In this section. we derive some important inequalities. These inequalities use 
the mean and possibly the variance of a random variable to draw conclusions 
on the probabilities of certain events. They are primarily useful in situations 
where exact values or bounds for the mean and variance of a random variable X 
are easily computable. but the distribution of X is either unavailable or hard to 
calculate. 

We first present the Markov inequality. Loosely speaking. it asserts that 
if à nonnegative random variable has a small mean. then the probability that it 
takes a large value must also be small. 


Markov Inequality 


If a random variable X can only take nonnegative values, then 


P(X >a)< BU, for all a > 0. 





To justify the Markov inequality, let us fix a positive number a and consider 
the random variable Y, defined by 


0, ifX «a, 
er if X >a. 


It is seen that the relation 


always holds and therefore. 
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Figure 5.1: Illustration of the derivation of the Markov inequality. Part (a) of 
the figure shows the PDF of a nonnegative random variable X. Part (b) shows 
the PMF of a related random variable Ya. which is constructed as follows. All of 
the probability mass in the PDF of X that lies between 0 and a is assigned to 0, 
and all of the mass that lies above a is assigned to a. Since mass is shifted to the 
left. the expectation can only decrease and, therefore. 


E[X] > E[Y.] = aP(Y, = a) = aP(X >a). 


On the other hand, 
E|Y,] = aP(Ya = a) = aP(X > a), 


from which we obtain 
aP(X >a) € E[X]: 


see Fig. 5.1 for an illustration. 


Example 5.1. Let X be uniformly distributed in the interval [0.4] and note that 
E[X] = 2. Then. the Markov inequality asserts that 


P(X 22) <5 51, P(X > 3) < = = 0.67. P(X > 4) < Ê - 05. 


WIN 


By comparing with the exact probabilities 
P(X > 2)=05. P(X > 3) = 0.23. P(X > 4) =0, 


we see that the bounds provided by the Markov inequality can be quite loose. 
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We continue with the Chebyshev inequality. Loosely speaking, it asserts 
that if a random variable has small variance, then the probability that it takes a 
value far from its mean is also small. Note that the Chebyshev inequality does 
not require the random variable to be nonnegative. 


Chebyshev Inequality 


If X is a random variable with mean p and variance c?, then 


2 
P(XX-ypu»c)x "P for all c » 0. 





To justify the Chebyshev inequality, we consider the nonnegative random 
variable (X — 4)? and apply the Markov inequality with a = c?. We obtain 


ElX-4?| o? 
P((X —p)? >?) < eae er at 
We complete the derivation by observing that the event (X —4)? > c? is identical 
to the event |X — u| > c, so that 


c? 
= = — 4)\2 > ec oe 
P(IX - | » c) =P((X - 9? 2 ?) € 5. 
For a similar derivation that bypasses the Markov inequality, assume for 
simplicity that X is a continuous random variable, introduce the function 


..j[0, iflrz-u| «c, 
ota) = { if |z — p| > c, 


note that (z — u)? > g(x) for all z, and write 


oo 


e» [eu fxla)dz> f^ ola) fxle)dr = e PX - d 20), 


oo — 0 


which is the Chebyshev inequality. 
An alternative form of the Chebyshev inequality is obtained by letting 
c = ka, where k is positive, which yields 


c? 1 
P(IX -u| 2 ko) < Bo T p 


Thus, the probability that a random variable takes a value more than k standard 
deviations away from its mean is at most 1/k?. 
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The Chebyshev inequality tends to be more powerful than the Markov 
inequality (the bounds that it provides are more accurate), because it also uses 
information on the variance of X. Still, the mean and the variance of a random 
variable are only a rough summary of its properties, and we cannot expect the 
bounds to be close approximations of the exact probabilities. 


Example 5.2. As in Example 5.1, let X be uniformly distributed in [0, 4]. Let us 
use the Chebyshev inequality to bound the probability that |X — 2| > 1. We have 
c? = 16/12 = 4/3, and 
4 
P(|X -2| 21) < 5, 
which is uninformative. 
For another example, let X be exponentially distributed with parameter A — 
1, so that E[X] = var(X) = 1. For c > 1, using the Chebyshev inequality, we 
obtain 


P(X29)-P(X-12c-1)&P(X-112e-1) < ay 


This is again conservative compared to the exact answer P(X > c) = e^. 


Example 5.3. Upper Bounds in the Chebyshev Inequality. When X is 


known to take values in a range [a,b], we claim that c? < (b — a)?/4. Thus, if 


c? is unknown, we may use the bound (b — a)?/4 in place of o? in the Chebyshev 


inequality, and obtain 


E b-a)? 


P(|X — | >c) EU for all c » 0. 


To verify our claim, note that for any constant y, we have 
E[(X - 5] = E[X*] - 2E[X]y +7’, 
and the above quadratic is minimized when y = E[X]. It follows that 
o =E | — E(x))' < E|[(X - a)" | for all y. 


By letting y = (a + b)/2, we obtain 


^ «g(x- zo = E[(X —a)(X —b)] gba o (b= a)? 





4 a 4 


where the equality above is verified by straightforward calculation, and the last 
inequality follows from the fact 


(x —a)(x—b) <0 


for all x in the range [a,b]. 

The bound c? < (b — a)? /4 may be quite conservative, but in the absence of 
further information about X, it cannot be improved. It is satisfied with equality 
when X is the random variable that takes the two extreme values a and 6 with 
equal probability 1/2. 


5.2 
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THE WEAK LAW OF LARGE NUMBERS 


The weak law of large numbers asserts that the sample mean of a large number 
of independent identically distributed random variables is very close to the true 
mean, with high probability. 

As in the introduction to this chapter, we consider a sequence X1, X2,... of 
independent identically distributed random variables with mean yz and variance 
c?, and define the sample mean by 


M, = 


We have 
_ E[X,]+---+E[X,] _ nu 


E[M;] - : 


=H, 
and, using independence, 


var(X; +---+ X4) var(X1) +---+var(Xn) no? o? 
Ngee o 


We apply the Chebyshev inequality and obtain 


2 
P(|M, — ul > €) < M for any e > 0. 
We observe that for any fixed e > 0, the right-hand side of this inequality goes to 
zero as n increases. As a consequence, we obtain the weak law of large numbers, 
which is stated below. It turns out that this law remains true even if the X; 
have infinite variance, but a much more elaborate argument is needed, which we 
omit. The only assumption needed is that E[X;] is well-defined. 


The Weak Law of Large Numbers 


Let Xi, X2,... be independent identically distributed random variables with 
mean pt. For every e > 0, we have 


Xie Xn 
n 


P(M.-429) =P (| 


-a| 2€) +0, as n — oo. 





The weak law of large numbers states that for large n, the bulk of the 
distribution of Mn is concentrated near p. That is, if we consider a positive 
length interval [u — e, + e| around y, then there is high probability that Mn 
will fall in that interval; as n — oo, this probability converges to 1. Of course, if 
€ is very small, we may have to wait longer (i.e., need a larger value of n) before 
we can assert that Mn is highly likely to fall in that interval. 
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Example 5.4. Probabilities and Frequencies. Consider an event A defined in 
the context of some probabilistic experiment. Let p = P(A) be the probability of 
this event. We consider n independent repetitions of the experiment, and let Mn 
be the fraction of time that event A occurs; in this context, M, is often called the 
empirical frequency of A. Note that 


where X; is 1 whenever A occurs, and 0 otherwise; in particular, E[X;] = p. The 
weak law applies and shows that when n is large, the empirical frequency is most 
likely to be within € of p. Loosely speaking, this allows us to conclude that em- 
pirical frequencies are faithful estimates of p. Alternatively, this is a step towards 
interpreting the probability p as the frequency of occurrence of A. 


Example 5.5. Polling. Let p be the fraction of voters who support a particular 
candidate for office. We interview n “randomly selected" voters and record Mn, 
the fraction of them that support the candidate. We view Mn as our estimate of p 
and would like to investigate its properties. 

We interpret “randomly selected" to mean that the n voters are chosen inde- 
pendently and uniformly from the given population. Thus, the reply of each person 
interviewed can be viewed as an independent Bernoulli random variable X, with 
success probability p and variance o? = p(1 — p). The Chebyshev inequality yields 


p(l — p) 

ne? ` 
The true value of the parameter p is assumed to be unknown. On the other hand, 
it may be verified that p(1 — p) € 1/4 (cf. Example 5.3), which yields 


P(|Mn — pl 2 €) < 





P(IM, — pl > €) < 


T 4ne?' 
For example, if € = 0.1 and n = 100, we obtain 
P (|M = pl 2 0.1) S ———— m = 0.25. 
4-100: (0.1)? 
In words, with a sample size of n = 100, the probability that our estimate is 


incorrect by more than 0.1 is no larger than 0.25. 

Suppose now that we impose some tight specifications on our poll. We would 
like to have high confidence (probability at least 95%) that our estimate will be 
very accurate (within .01 of p). How many voters should be sampled? 

The only guarantee that we have at this point is the inequality 


1 
4n(0.01)? 


We will be sure to satisfy the above specifications if we choose n large enough so 


that i 
zo oed 0:95 = 0.05, 
4n(0.01)2 ^ 9 


which yields n > 50,000. This choice of n satisfies our specifications, but turns 
out to be fairly conservative, because it is based on the rather loose Chebyshev 
inequality. A refinement will be considered in Section 5.4. 


P(|Mn — p| 2 0.01) < 
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5.3 CONVERGENCE IN PROBABILITY 


We can interpret the weak law of large numbers as stating that "M; converges to 
1." However. since Mi, M23.... is a sequence of random variables. not a sequence 
of numbers. the meaning of convergence has to be made precise. A particular 
definition is provided below. To facilitate the comparison with the ordinary 
notion of convergence. we also include the definition of the latter. 


Convergence of a Deterministic Sequence 


Let a1,a2,... be a sequence of real numbers, and let a be another real 
number. We say that the sequence an converges to a, or lim, o5 dn = a, if 
for every « > 0 there exists some ng such that 


lan — al € e, for all n > no. 





Intuitively. if limps. an = a. then for any given accuracy level e. &,, must 
be within c€ of a. when n is large enough. 


Convergence in Probability 


Let Yi, Y2,... be a sequence of random variables (not necessarily indepen- 
dent), and let a be a real number. We say that the sequence Y, converges 
to a in probability, if for every « > 0, we have 


jim P(|¥n -a| >e) =0. 





Given this definition. the weak law of large numbers simply states that the 
sample mean converges in probability to the true mean yz. More generally, the 
Chebyshev inequality implies that if all Yp have the same mean p. and var(Y;) 
converges to 0, then Yn converges to yz in probability. 

If the random variables Yi. Y2,... have a PMF or a PDF and converge 
in probability to a. then according to the above definition. "almost all” of the 
PMF or PDF of Yn is concentrated within e of a for large values of n. It is also 
instructive to rephrase the above definition as follows: for every < > 0. and for 
every ô > 0. there exists some ng such that 


P(\Yn -a| > €) < ô. for all n > ng. 


If we refer to € as the accuracy level. and 6 as the confidence level. the definition 
takes the following intuitive form: for any given level of accuracy and confidence. 
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Y, will be equal to a, within these levels of accuracy and confidence, provided 
that n is large enough. 


Example 5.6. Consider a sequence of independent random variables Xn that are 
uniformly distributed in the interval [0. 1], and let 


Ya = min(Xi,.... Xn}. 


The sequence of values of Y, cannot increase as n increases, and it will occasionally 
decrease (whenever a value of Xn that is smaller than the preceding values is ob- 
tained). Thus, we intuitively expect that Y, converges to zero. Indeed, for e > 0. 
we have using the independence of the Xn, 


P(|¥n -0] > e) = P(X12 e&....X« 2 e 
= P(Xi2 €). P(Xn 2 €) 
=(1—.e)”. 

In particular. 
lim P(|¥n - 0| > €) = lim (1 - 9" =0. 


Since this is true for every € > 0. we conclude that Y, converges to zero, in proba- 
bility. 


Example 5.7. Let Y be an exponentially distributed random variable with pa- 
rameter \ = 1. For any positive integer n, let Y, = Y/n. (Note that these random 
variables are dependent.) We wish to investigate whether the sequence Yn converges 
to zero. 

For € > 0. we have 


P(Y,-0|2e) = P(Y, > €) = P(Y > ne) =e ™. 


In particular, 
lim P(|¥, — 0| > €) = lim e^"* =0. 


Tn —25c 


Since this is the case for every € > 0. Y, converges to zero, in probability. 


One might be tempted to believe that if a sequence Yn converges to a 
number a, then E[Y;] must also converge to a. The following example shows 
that this need not be the case, and illustrates some of the limitations of the 
notion of convergence in probability. 


Example 5.8. Consider a sequence of discrete random variables Yn with the 
following distribution: 


peo for y — 0, 
n 


P(Y, —y)— for y 2 n?, 


, elsewhere: 
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see Fig. 5.2 for an illustration. For every e > 0. we have 
lim P(|Yn] 2 e) = lim — =0, 
n—25c n 


and Y, converges to zero in probability. On the other hand, E[Yn] = n?/n = n. 
which goes to infinity as n increases. 


l-in 


PMF of Y, 





Figure 5.2: The PMF of the random variable Y, in Example 5.8. 


5.4 THE CENTRAL LIMIT THEOREM 


According to the weak law of large numbers, the distribution of the sample mean 
Mn = (X1 +--+ Xn)/n is increasingly concentrated in the near vicinity of the 
true mean 44 In particular, its variance tends to zero. On the other hand, the 
variance of the sum 

n= Xite t+ Xn = nM, 


increases to infinity, and the distribution of Sn cannot be said to converge to 
anything meaningful. An intermediate view is obtained by considering the devi- 
ation Sn — np of S, from its mean ny, and scaling it by a factor proportional to 
1//n. What is special about this particular scaling is that it keeps the variance 
at a constant level. The central limit theorem asserts that the distribution of 
this scaled random variable approaches a normal distribution. 

More precisely, let Xi. X5,... be a sequence of independent identically 
distributed random variables with mean p and variance o*. We define 


S,—-np Xi: Xs - np 


2 
oyn oyn 


An easy calculation yields 


EZ, = = —- 0, 
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and 


var(X1------- X4)  var(Xi) t var(X4) | no? 
HUE ILL 4A ge 






The Central Limit Theorem 


Let X1, X2,... be a sequence of independent identically distributed random 
variables with common mean p and variance o?, and define 
















— Xit-e::+Xn-— np 


LS 
oyn 


Then, the CDF of Zn converges to the standard normal CDF 


(z) = zl e~"/2 dr, 






in the sense that 







lim P(Z, < z) = (2), for every z. 


n— oo 





The central limit theorem is surprisingly general. Besides independence, 
and the implicit assumption that the mean and variance are finite, it places 
no other requirement on the distribution of the X;, which could be discrete. 
continuous, or mixed; see the end-of-chapter problems for an outline ofits proof. 

This theorem is of tremendous importance for several reasons, both concep- 
tual and practical. On the conceptual side, it indicates that the sum of a large 
number of independent random variables is approximately normal. As such, it 
applies to many situations in which a random effect is the sum of a large number 
of small but independent random factors. Noise in many natural or engineered 
systems has this property. In a wide array of contexts, it has been found empir- 
ically that the statistics of noise are well-described by normal distributions, and 
the central limit theorem provides a convincing explanation for this phenomenon. 

On the practical side, the central limit theorem eliminates the need for 
detailed probabilistic models, and for tedious manipulations of PMFs and PDFs. 
Rather, it allows the calculation of certain probabilities by simply referring to the 
normal CDF table. Furthermore. these calculations only require the knowledge 
of means and variances. 


Approximations Based on the Central Limit Theorem 


The central limit theorem allows us to calculate probabilities related to Zn as 
if Zn were normal. Since normality is preserved under linear transformations, 
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this is equivalent to treating Sn as a normal random variable with mean ny and 
variance no2. 


Normal Approximation Based on the Central Limit Theorem 


Let Sn = Xi +- + Xn, where the X; are independent identically dis- 
tributed random variables with mean yp and variance c?. If n is large, the 
probability P(S, < c) can be approximated by treating Sn as if it were 


normal, according to the following procedure. 


1. Calculate the mean ny and the variance no? of Sn. 


2. Calculate the normalized value z = (c — ny)/o yn. 


3. Use the approximation 
P(S, € c) = (2), 


where ®(z) is available from standard normal CDF tables. 





Example 5.9. We load on a plane 100 packages whose weights are independent 
random variables that are uniformly distributed between 5 and 50 pounds. What is 
the probability that the total weight will exceed 3000 pounds? It is not easy to cal- 
culate the CDF of the total weight and the desired probability, but an approximate 
answer can be quickly obtained using the central limit theorem. 

We want to calculate P(S1oo > 3000), where S1oo is the sum of the weights 
of 100 packages. The mean and the variance of the weight of a single package are 


R2 
S450) Gre. iu oe = 168.75, 





based on the formulas for the mean and variance of the uniform PDF. We calculate 
the normalized value 


_ 3000 — 100.205 250 _j 4) 


-  /168.75-100. 129.9 


and use the standard normal tables to obtain the approximation 
P(Sioo € 3000) = (1.92) = 0.9726. 
Thus, the desired probability is 


P(Sioo > 3000) = 1 — P(Sioo € 3000) ~ 1 — 0.9726 = 0.0274. 


Example 5.10. A machine processes parts, one at a time. The processing times 
of different parts are independent random variables, uniformly distributed in [1, 5]. 
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We wish to approximate the probability that the number of parts processed within 
320 time units, denoted by N32o, is at least 100. 

There is no obvious way of expressing the random variable N329 as a sum 
of independent random variables, but we can proceed differently. Let X; be the 
processing time of the ith part, and let S100 = Xı +- - -+X100 be the total processing 
time of the first 100 parts. The event (Na;9 > 100) is the same as the event 
[S100 € 320), and we can now use a normal approximation to the distribution of 
Sioo. Note that u = E[X,] = 3 and c? = var(Xi) = 16/12 = 4/3. We calculate the 
normalized value 


she 320 —np _ 320 — 300 


= 1.73, 
ayn 100 - 4/3 


and use the approximation 
P(Sioo < 320) ~ (1.73) = 0.9582. 


If the variance of the X; is unknown, but an upper bound is available, 
the normal approximation can be used to obtain bounds on the probabilities of 
interest. 


Example 5.11. Polling. Let us revisit the polling problem in Example 5.5. 
We poll n voters and record the fraction Mn of those polled who are in favor of a 
particular candidate. If p is the fraction of the entire voter population that supports 


this candidate, then 
Xp E XS 


n 


Mn 


where the X; are independent Bernoulli random variables with parameter p. In 
particular, Mn has mean p and variance p(1 — p)/n. By the normal approximation. 
Xı +--+ Xn is approximately normal, and therefore M, is also approximately 
normal. 

We are interested in the probability P(|M, — p| > €) that the polling error is 
larger than some desired accuracy e. Because of the symmetry of the normal PDF 
around the mean, we have 


P (Ms -p| > e) x 2P(M, -p> e). 


The variance p(1 — p)/n of M, — p depends on p and is therefore unknown. We note 
that the probability of a large deviation from the mean increases with the variance. 
Thus, we can obtain an upper bound on P(Mn — p > €) by assuming that Mn — P 
has the largest possible variance, namely, 1/(4n) which corresponds to p — 1/2. To 
calculate this upper bound, we evaluate the standardized value 


zZ 


€ 
|OMQyR) 
and use the normal approximation 


P(M, — p 2 €) €«1-—96(2)-1 - e (2e Vn). 
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For instance, consider the case where n = 100 and e = 0.1. Assuming the 
worst-case variance, and treating M, as if it were normal, we obtain 


P ([Mioo — p| > 0.1) = 2P(M, — p > 0.1) 
<2- 26(2 -0.1- v/100) —2-—206(222-2-.0.977 = 0.046. 
This is much smaller (and more accurate) than the estimate of 0.25 that was ob- 
tained in Example 5.5 using the Chebyshev inequality. 
We now consider a reverse problem. How large a sample size n is needed 


if we wish our estimate M, to be within 0.01 of p with probability at least 0.95? 
Assuming again the worst possible variance, we are led to the condition 


2 — 26 (2.0.01: Vn) < 0.05, 


Or 
(2.0.01 - vn) > 0.975. 


From the normal tables, we see that (1.96) = 0.975, which leads to 
2- 0.01 - yn > 1.96, 


or 
(1.96) - 


"= 3. (0.01)? 
This is significantly better than the sample size of 50,000 that we found using 
Chebyshev's inequality. 





— 9604. 


The normal approximation is increasingly accurate as n tends to infinity, 
but in practice we are generally faced with specific and finite values of n. It 
would be useful to know how large n should be before the approximation can 
be trusted, but there are no simple and general guidelines. Much depends on 
whether the distribution of the X; is close to normal and, in particular, whether 
it is symmetric. For example, if the X; are uniform, then Ss is already very close 
to normal. But if the X; are, say, exponential. a significantly larger n will be 
needed before the distribution of S, is close to a normal one. Furthermore, the 
normal approximation to P(S» < c) tends to be more faithful when c is in the 
vicinity of the mean of Sy. 


De Moivre-Laplace Approximation to the Binomial 
A binomial random variable Sn with parameters n and p can be viewed as the 


sum of n independent Bernoulli random variables X;,.... Xn, with common 


parameter p: 
Sn — X14 Xn. 


u= E[X:] =p, o= y var( Xi) = y p(l — p) 


Recall that 
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(a) (b) 


Figure 5.3: The central limit approximation treats a binomial random variable 
Sn as if it were normal with mean np and variance np(1 — p). This figure shows 
a binomial PMF together with the approximating normal PDF. (a) A first ap- 
proximation of a binomial probability P(k < Sn < l) is obtained by integrating 
the area under the norma] PDF from k to l, which is the shaded area in the 
figure. With this approach. if we have k = l. the probability P(S, = k) will be 
approximated by zero. (b) A possible remedy is to use the normal probability 
between k — 5 and k + à to approximate P(Sn = k). By extending this idea, 
P(k € S, < l) can be approximated by using the area under the normal PDF 


from k — 3 to L+ 4, which corresponds to the shaded area. 


We will now use the approximation suggested by the central limit theorem to 
provide an approximation for the probability of the event {k < Sn < l}, where 
k and l are given integers. We express the event of interest in terms of a stan- 
dardized random variable, using the equivalence 


k — np Sn — np l— np 


ee cnp aN 
vnp(1-p) vnp(1-p) ynpll-p) 


k«S,«l — 


By the central limit theorem, (Sn — np)/ /np(1 — p) has approximately a stan- 
dard normal distribution, and we obtain 


Pikes iS 2. 2588709 n. 100p 


vnp(1—-p)  vnp(1-p)  Vmnp(l- p) 
«( l—np _6 k—np 
np(1 =p) np(1 — p) 


An approximation of this form is equivalent to treating Sn as a normal 
random variable with mean np and variance np(1 — p). Figure 5.3 provides an 
illustration and indicates that a more accurate approximation may be possible if 
we replace k and l by k — $ and l + L, respectively. The corresponding formula 
is given below. 
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De Moivre-Laplace Approximation to the Binomial 


If S, is a binomial random variable with parameters n and p, n is large, and 
k, l are nonnegative integers, then 


l4 ri k—-i- 
pess «s a (its) -o (1). 


np(1 — p) np(1 — p) 





When p is close to 1/2, in which case the PMF of the X; is symmetric, the 
above formula yields a very good approximation for n as low as 40 or 50. When 
p is near 1 or near 0. the quality of the approximation drops. and a larger value 
of n is needed to maintain the same accuracy. 


Example 5.12. Let S, be a binomial random variable with parameters n — 36 
and p — 0.5. An exact calculation yields 


21 
P(S, «21) - i) (0.5)?9 = 0.8785. 
k=0 


The central limit theorem approximation. without the above discussed refinement. 
yields 





P(S,.«2)se|—Lz?P. |. (= z =) = 6(1) = 0.8413. 
np(1 — p) 7 


Using the proposed refinement, we have 


P(S, < 21) x p | 22 = @ (7° =) = 00.17) = 0.879, 
np(1 — p) 


which is much closer to the exact value. 
The de Moivre-Laplace formula also allows us to approximate the probability 
of a single value. For example. 


19.5 — 18 


P(S, = 19) = © ( - 


) _6 (==) = 0.6915 — 05675 = 0.124. 


This is very close to the exact value which is 


£3 (0.5)?9 = 0.1251. 


5.5 
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THE STRONG LAW OF LARGE NUMBERS 


The strong law of large numbers is similar to the weak law in that it also deals 
with the convergence of the sample mean to the true mean. It is different, 
however, because it refers to another type of convergence. 

The following is a general statement of the strong law of large numbers. 
A proof of the strong law, under the mildly restrictive assumption that the X; 
have finite fourth moments is developed in the end-of-chapter problems. 


The Strong Law of Large Numbers 


Let X1, X2,... bea sequence of independent identically distributed random 
variables with mean u. Then, the sequence of sample means M, = (X1 + 


0 X,,)/n converges to u, with probability 1, in the sense that 


P ( im Liti oy) =1, 


noo n 





In order to interpret the strong law of large numbers. we need to go back to 
our original description of probabilistic models in terms of sample spaces. The 
contemplated experiment is infinitely long and generates a sequence of values, 
one value for each one of the random variables in the sequence X1. X2..... Thus, 
it is best to think of the sample space as a set of infinite sequences (z1, £2....) 
of real numbers: any such sequence is a possible outcome of the experiment. 
Let us now consider the set A consisting of those sequences (z1,22....) whose 
long-term average is ji. i.e.. 


io Tita tT 
(zi,22....)€ A = lim a ey, 


n—oc n 


The strong law of large numbers states that all of the probability is concentrated 
on this particular subset of the sample space. Equivalently, the collection of 
outcomes that do not belong to A (infinite sequences whose long-term average 
is not u) has probability zero. 

The difference between the weak and the strong law is subtle and deserves 
close scrutiny. The weak law states that the probability P(|M; -u| >) of a 
significant deviation of Mn from pu goes to zero as n — oc. Still, for any finite 
n, this probability can be positive and it is conceivable that once in a while, 
even if infrequently, M; deviates significantly from u. The weak law provides 
no conclusive information on the number of such deviations. but the strong law 
does. According to the strong law. and with probability 1. M; converges to p. 
This implies that for any given e > 0, the probability that the difference | M; — u| 
will exceed € an infinite number of times is equal to zero. 
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Example 5.13. Probabilities and Frequencies. As in Example 5.4, consider an 
event A defined in terms of some probabilistic experiment. We consider a sequence 
of independent repetitions of the same experiment. and let Af, be the fraction of 
the first n repetitions in which A occurs. The strong law of large numbers asserts 
that Mn converges to P(A), with probability 1. In contrast, the weak law of large 
numbers asserts that M; converges to P(A) in probability (cf. Example 5.4). 

We have often talked intuitively about the probability of an event A as the 
frequency with which it occurs in an infinitely long sequence of independent trials. 
The strong law backs this intuition and establishes that the long-term frequency of 
occurrence of A is indeed equal to P(A), with essential certainty (the probability 
of this happening is 1). 


Convergence with Probability 1 


The convergence concept behind the strong law is different than the notion em- 
ployed in the weak law. We provide here a definition and some discussion of this 
new convergence concept. 


Convergence with Probability 1 


Let Yi, Y2,... be a sequence of random variables (not necessarily indepen- 
dent). Let c be a real number. We say that Yn converges to c with prob- 
ability 1 (or almost surely) if 


P ( lim Yu 


n—> OO 





Similar to our earlier discussion. a proper interpretation of this type of 
convergence involves a sample space consisting of infinite sequences: all of the 
probability is concentrated on those sequences that converge to c. This does not 
mean that other sequences are impossible, only that they are extremely unlikely, 
in the sense that their total probability is zero. 


Example 5.14. Let X).X2.... be a sequence of independent random variables 
that are uniformly distributed in (0, 1], and let Y, = min(Xi,.... Xn}. We wish to 
show that Yn converges to 0. with probability 1. 

In any execution of the experiment, the sequence Y, is nonincreasing. i.e., 
Yn+1 € Yn for all n. Since this sequence is bounded below by zero. it must have a 
limit. which we denote by Y. Let us fix some € > 0. We have Y > e if and only if 
X, > e for all i, which implies that 
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Since this is true for all n. we must have 


P(Y » €) € lim (1— e)" =0. 
This shows that P(Y > e) = 0, for any positive e. We conclude that P(Y > 0) = 0, 
which implies that P(Y = 0) = 1. Since Y is the limit of Yn, we see that Yn 
converges to zero with probability 1. 


Convergence with probability 1 implies convergence in probability (see the 
end-of-chapter problems), but the converse is not necessarily true. Our last 
example illustrates the difference between convergence in probability and con- 
vergence with probability 1. 


Example 5.15. Consider a discrete-time arrival process. The set of times is 
partitioned into consecutive intervals of the form J, = 1279" BD em? hase 1}. 
Note that the length of J, is 2*, which increases with k. During each interval Iy, 
there is exactly one arrival, and all times within an interval are equally likely. The 
arrival times within different intervals are assumed to be independent. Let us define 
Yn = 1 if there is an arrival at time n, and Y, = 0 if there is no arrival. 

We have P(Y, # 0) = 1/2", if n € Ix. Note that as n increases, it belongs to 
intervals J, with increasingly large indices k. Consequently, 


and we conclude that Yn converges to 0 in probability. However, when we carry out 
the experiment, the total number of arrivals is infinite (one arrival during each 
interval I). Therefore, Y, is unity for infinitely many values of n, the event 
{limn— Yn = 0) has zero probability, and we do not have convergence with prob- 
ability 1. 

Intuitively. the following is happening. At any given time, there is only a 
small. and diminishing with n. probability of a substantial deviation from 0, which 
implies convergence in probability. On the other hand. given enough time, a sub- 
stantial deviation from 0 is certain to occur and for this reason, we do not have 
convergence with probability 1. 


5.6 SUMMARY AND DISCUSSION 


In this chapter. we explored some fundamental aspects of probability theory that 
have major conceptual and practical implications. On the conceptual side, they 
put on a firm ground the interpretation of probability as relative frequency in 
a large number of independent trials. On the practical side, they allow the ap- 
proximate calculation of probabilities in models that involve sums of independent 
random variables and that would be too hard to compute with other means. We 
will see a wealth of applications in the chapter on statistical inference. 
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(a) 


We discussed three major laws that take the form of limit theorems. 


The first one, the weak law of large numbers, indicates that the sample 
mean is very likely to be close to the true mean, as the sample size in- 
creases. It is based on the Chebyshev inequality, which is of independent 
interest and is representative of a large collection of useful inequalities that 
permeate probability theory. 


The second one, the central limit theorem, is one of the most remarkable 
results of probability theory, and asserts that the sum of a large number 
of independent random variables is approximately normal. The central 
limit theorem finds many applications: it is one of the principal tools of 
statistical analysis and also justifies the use of normal random variables in 
modeling a wide array of situations. 


The third one, the strong law of large numbers, makes a more emphatic 
connection of probabilities and relative frequencies, and is often an impor- 
tant tool in theoretical studies. 


While developing the various limit theorems, we introduced a number of 


convergence concepts (convergence in probability and convergence with proba- 
bility 1), which provide a precise language for discussing convergence in proba- 
bilistic models. The limit theorems and the convergence concepts discussed in 
this chapter underlie several more advanced topics in the study of probabilistic 
models and stochastic processes. 
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PROBLEMS 


SECTION 5.1. Some Useful Inequalities 


Problem 1. A statistician wants to estimate the mean height h (in meters) of a 
population, based on n independent samples Xi,..., Xn, chosen uniformly from the 
entire population. He uses the sample mean Mn = (X1 +- + X4)/n as the estimate 
of h, and a rough guess of 1.0 meters for the standard deviation of the samples X,. 


(a) How large should n be so that the standard deviation of Mn is at most 1 cen- 
timeter? 


(b) How large should n be so that Chebyshev's inequality guarantees that the esti- 
mate is within 5 centimeters from h, with probability at least 0.99? 


(c) The statistician realizes that all persons in the population have heights between 
1.4 and 2.0 meters, and revises the standard deviation figure that he uses based 
on the bound of Example 5.3. How should the values of n obtained in parts (a) 
and (b) be revised? 


Problem 2.* The Chernoff bound. The Chernoff bound is a powerful tool that 
relies on the transform associated with a random variable, and provides bounds on the 
probabilities of certain tail events. 


(a) Show that the inequality 
P(X >a) <e *M(s) 


holds for every a and every s > 0, where M(s) = E[e?*] is the transform associ- 
ated with the random variable X, assumed to be finite in a small open interval 
containing s = 0. 


(b) Show that the inequality 
P(X < a) <e **M(s) 


holds for every a and every s < 0. 


(c) Show that the inequality 
P(X»a)ze *9 


holds for every a, where 


ola) = ma (sa — Ìn M(s)). 


(d) Show that if a > E[X], then ¢(a) > 0. 
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(e) Apply the result of part (c) to obtain a bound for P(X > a), for the case where 
X is a standard normal random variable and a > 0. 


(f) Let X1, X2,... be independent random variables with the same distribution as 
X. Show that for any a > E[X], we have 


ES -né(a) 
P (xx 2 <e i 


i=1 


so that the probability that the sample mean exceeds the mean by a certain 
amount decreases exponentially with n. 


Solution. (a) Given some a and s 2 0, consider the random variable Ya defined by 


It is seen that the relation 


always holds and therefore, 
E[Ya] < E[e?*] = M(s). 
On the other hand, 
E[Y,] = e'* P(Y, = e'^) = e^ P(X > a), 


from which we obtain 
P(X >a)<e *M(s). 


(b) The argument is similar to the one for part (a). We define Ya by 


e?. if X <a, 
5 - [5 if X » a. 


Since s < 0, the relation 
Ya < e?* 


always holds and therefore, 
E[Y.] < E[e?*] = M(s). 
On the other hand, 
E[Y;] = e^ P(Y, = e°?) 2 &° P(X <a). 


from which we obtain 
P(X <a) <e *M(s) 
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(c) Since the inequality from part (a) is valid for every s > 0, we obtain 
« : —sa 
P(X >a)< mun (e M(s)) 


Suse (sa-n M(s)) 
s>0 


ee maxs>0 (sain M(s)) 


= e7210), 


(d) For s = 0, we have 
sa — In M(s) =0-1n1=0, 


where we have used the generic property M(0) = 1 of transforms. Furthermore, 


rd =a-1-E[X]>0. 
s=0 


d 
z (se — |n M(s)) mx Me ds 





Since the function sa — 1n M(s) is zero and has a positive derivative at s = 0, it must be 
positive when s is positive and small. It follows that the maximum ¢(a) of the function 
sa — In M(s) over all s > 0 is also positive. 

(e) For a standard normal random variable X, we have M(s) — en Therefore, 
sa — In M(s) = sa — s?/2. To maximize this expression over all s > 0, we form the 
derivative, which is a — s, and set it to zero, resulting in s = a. Thus, ó(a) = a?/2, 
which leads to the bound 1 

P(X >a)<e? ”. 


Note: In the case where a < E[X], the maximizing value of s turns out to be s = 0, 
resulting in ó(a) = 0 and in the uninteresting bound 


P(X >a)<1 


(f) Let Y = Xı +---+ X4. Using the result of part (c), we have 


lc - 
Ss i> = > < e ey (na) 
»(i 3 za) P(Y >na)<e ; 


where 
gy (na) = m (nsa — In Mv (s)), 


and 
Mv (s) = (M(3))" 


is the transform associated with Y. We have In My (s) = nln M(s), from which we 
obtain 
$óv (na) 2 n- max (sa — In M(s)) = nó(a), 
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and 
n 
1 5 ) -nġ(a) 
i=l 


Note that when a > E[X], part (d) asserts that ¢(a) > 0, so the probability of interest 
decreases exponentially with n. 


Problem 3.* Jensen inequality. A twice differentiable real-valued function f of a 
single variable is called convex if its second derivative (d? f /dz?)(r) is nonnegative for 
all z in its domain of definition. 


(a) Show that the functions f(z) = e?^*, f(z) = — ln z, and f(z) = xt are all convex. 


(b) Show that if f is twice differentiable and convex, then the first order Taylor 
approximation of f is an underestimate of the function. that is, 


for every a and T. 


(c) Show that if f has the property in part (b), and if X is a random variable, then 
f(ELX]) < E[f(X)]. 


Solution. (a) We have 


d? d 1 d? 
ae c ae" >0, gaT) = 20 dst —4.3.z? 2 0. 


(b) Since the second derivative of f is nonnegative, its first derivative must be nonde- 
creasing. Using the fundamental theorem of calculus, we obtain 


re)- fe) [ Satz sa) | Fod= sa) 6-02). 


a 


(c) Since the inequality from part (b) is assumed valid for every possible value z of the 
random variable X, we obtain 


gr 


f(a) - (X -aS 


(a) < f(X). 
We now choose a = E[X] and take expectations, to obtain 
f (EIX])  (Elx] - Etx]) É (Elx]) < E[/()). 


or 


f(E[X]) < E[f(X)]. 
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SECTION 5.2. The Weak Law of Large Numbers 


Problem 4. In order to estimate f. the true fraction of smokers in a large population, 
Alvin selects n people at random. His estimator M; is obtained by dividing Sn. the 
number of smokers in his sample, by n, i.e., Mn = S,/n. Alvin chooses the sample 
size n to be the smallest possible number for which the Chebyshev inequality yields a 


guarantee that 
P(IM, - f| 2 €) < ô, 


where € and 6 are some prespecified tolerances. Determine how the value of n recom- 
mended by the Chebyshev inequality changes in the following cases. 


(a) The value of « is reduced to half its original value. 


(b) The probability 6 is reduced to half its original value. 


SECTION 5.3. Convergence in Probability 


Problem 5. Let X1. X2,... be independent random variables that are uniformly 
distributed over [—1.1]. Show that the sequence Yi, Y2.... converges in probability to 
some limit, and identify the limit, for each of the following cases: 


(a) Yn = Xn/n. 


(b) Ya = (Xn)”. 
(c) ye XS He 
(d) Yn = max(Xi,.... Xn}. 


Problem 6.* Consider two sequences of random variables X1. X2,... and Yi, Y2,..., 
which converge in probability to some constants. Let c be another constant. Show that 
CX5. X5, Ys. max(0, Xn}. |Xn|. and X4Y, all converge in probability to corresponding 
limits. 
Solution. Let z and y be the limits of Xn and Yn, respectively. Fix some € > 0 and a 
constant c. If c = 0, then c X, equals zero for all n, and convergence trivially holds. If 
c # 0, we observe that P(|cXn —cz| > c) = P(|Xn -x| > e/|cl). which converges to 
zero. thus establishing convergence in probability of cXn. 

We will now show that P(|Xn + Yn — z — y| > €) converges to zero, for any e > 0. 
To bound this probability. we note that for |X, + Yn — z — y| to be as large as e, we 
need either |X, — z| or |Y» — z| (or both) to be at least €/2. Therefore, in terms of 
events. we have 


(IX + Yn -z- y| 2 €} C {Xn - z| > e/2) U (IY — yl 2 e/2). 
This implies that 
P(|Xn + Yn -z-y| 2 €) < P(|Xn — 2| > €/2) + P (IYn — y| > €/2), 
and 


lim P(|JX,* Yn -z-y| 2 €) < im i P(IX, -z| 2 e/2)* lim P(|Yn -y| > €/2) = 
nx 


n—2o2o 
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where the last equality follows since Xn and Y, converge, in probability, to z and y, 
respectively. 

By a similar argument, it is seen that the event {| max(0, X,} — max(0, z)| > e} 
is contained in the event {|Xn -T| > e}. Since limn 5c P (IX. — T| 2 €) =0, this 
implies that 

lim P(| max (0. Xn} — max(0, z)| > €) = 0. 


Hence max(0, X4) converges to max(0, z} in probability. 

We have |X,,| = max(0, X,}+max{0, — X4). Since max(0, Xn} and max(0, —Xn} 
converge, as shown earlier, it follows that their sum, |X,|, converges to max(0, z} + 
max(0. —r) = |z| in probability. 

Finally, we have 


P(|XnYn — zy| > €) = P (|a. - z)(Y. — y) + Yn + yX« — 2zy| > e) 


< P(|(Xn - 3 —y)| > e/2) + P (IYn + YXn — 229 > €/2). 


Since zY, and yX, both converge to ry in probability. the last probability in the above 
expression converges to 0. It will thus suffice to show that 


lim P(|. — z)Y(Ys - y) 2 2) = 0. 


To bound this probability. we note that for |( X. —2Z)(Yn — y)| to be as large as €/2, we 


need either |X, — z| or |Y, — z| (or both) to be at least 4/e/2. The rest of the proof 
is similar to the earlier proof that Xn + Yn converges in probability. 


Problem 7.* A sequence Xn of random variables is said to converge to a number c 
in the mean square, if 
lim E[(Xn - c)?] = 0. 


n—oo 


(a) Show that convergence in the mean square implies convergence in probability. 


(b) Give an example that shows that convergence in probability does not imply con- 
vergence in the mean square. 


Solution. (a) Suppose that X4 converges to c in the mean square. Using the Markov 
inequality, we have 


P(|Xn -c| > €) = P (|Xn - e? > ê) < 
Taking the limit as n — oc. we obtain 
lim P(|X; -c| > e) = 0. 


which establishes convergence in probability. 


(b) In Example 5.8, we have convergence in probability to 0 but E[Y,2] = n°, which 
diverges to infinity. 
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SECTION 5.4. The Central Limit Theorem 


Problem 8. Before starting to play the roulette in a casino, you want to look for 
biases that you can exploit. You therefore watch 100 rounds that result in a number 
between 1 and 36. and count the number of rounds for which the result is odd. If the 
count exceeds 55, you decide that the roulette is not fair. Assuming that the roulette is 
fair, find an approximation for the probability that you will make the wrong decision. 


Problem 9. During each day. the probability that your computer's operating system 
crashes at least once is 596, independent of every other day. You are interested in the 
probability of at least 45 crash-free days out of the next 50 days. 


(a) Find the probability of interest by using the normal approximation to the bino- 
mial. 


(b) Repeat part (a), this time using the Poisson approximation to the binomial. 
Problem 10. A factory produces X, gadgets on day n. where the Xn are independent 
and identically distributed random variables, with mean 5 and variance 9. 


(a) Find an approximation to the probability that the total number of gadgets pro- 
duced in 100 days is less than 440. 


(b) Find (approximately) the largest value of n such that 


P(Xi +--+ + Xn > 200 + 5n) < 0.05. 


(c) Let N be the first day on which the total number of gadgets produced exceeds 
1000. Calculate an approximation to the probability that N > 220. 


Problem 11. Let Xı. Y1, X2. Y2.... be independent random variables, uniformly 
distributed in the unit interval [0. 1], and let 


W= (Xi + Xie) ^ (Yi +--: + Vie) 
7 16 ; 


Find a numerical approximation to the quantity 


P(|W — E[W]| < 0.001). 


Problem 12.* Proof of the central limit theorem. Let X1, X2,... bea sequence 
of independent identically distributed zero-mean random variables with common vari- 
ance c?, and associated transform Mx(s). We assume that Mx(s) is finite when 
—d « s « d, where d is some positive number. Let 


Xi +-+ Xn 


Zn = 
ovyn 


(a) Show that the transform associated with Z, satisfies 


Mz, (s) = (ws (zz). 
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(b) Suppose that the transform M x(s) has a second order Taylor series expansion 
around s = 0, of the form 


Mx(s) = a+ bs + cs? + o(s?), 


where o(s?) is a function that satisfies lims—o 0(s”)/s? = 0. Find a, b, and c in 
terms of o°. 


(c) Combine the results of parts (a) and (b) to show that the transform Mz, (s) 
converges to the transform associated with a standard normal random variable, 
that is, 

2 
lim Mz,(s) = e? /?, for all s. 
Note: The central limit theorem follows from the result of part (c), together with the 
fact (whose proof lies beyond the scope of this text) that if the transforms Mz, (s) 
converge to the transform Mz(s) of a random variable Z whose CDF is continuous, 
then the CDFs Fz,, converge to the CDF of Z. In our case, this implies that the CDF 
of Zn converges to the CDF of a standard normal. 


Solution. (a) We have, using the independence of the Xi, 


Mz,(s) =E [e^ 7^] 


-E EI 








a = Mx(0) =1, b= 7 Mx(s) = E[X] =0, 
and 5 2 : 
l d E|X c 
$ 2 "as M ) kc 2 i 2 





(c) We combine the results of parts (a) and (b). We have 


see (i) = ont (2) 


and using the formulas for a, b, and c from part (b), it follows that 


s? s? ý 
Ma(s)= (1 eo.) z 
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We now take the limit as n — oo, and use the identity 


lim (1 + £) =e, 
n 


n-—oo 


to obtain P 
lim Mz, (s) = e° /?. 
n— oc 


SECTION 5.5. The Strong Law of Large Numbers 


Problem 13.* Consider two sequences of random variables Xi, X2,... and Yi, Y2,.... 
Suppose that Xn converges to a and Y, converges to b, with probability 1. Show that 
Xn+Yn converges to a+b, with probability 1. Also, assuming that the random variables 
Yn cannot be equal to zero, show that X,/Y, converges to a/b, with probability 1. 


Solution. Let A (respectively, B) be the event that the sequence of values of the 
random variables X; (respectively, Yn) does not converge to a (respectively, b). Let C 
be the event that the sequence of values of Xn + Yn does not converge to a + b and 
notice that C C AU B. 

Since Xn and Y, converge to a and b, respectively, with probability 1, we have 
P(A) = 0 and P(B) = 0. Hence, 


P(C) < P(AU B) < P(A) + P(B) = 0. 


Therefore. P(C^) = 1, or equivalently, Xn + Yn converges to a + b with probability 1. 
For the convergence of Xn/Yn, the argument is similar. 


Problem 14.* Let X1.X2,... be a sequence of independent identically distributed 
random variables. Let Yi,Y2,... be another sequence of independent identically dis- 
tributed random variables. We assume that the X, and Y, have finite mean, and that 
Yi +--+ Yn cannot be equal to zero. Does the sequence 


Z X1 ot X8 
k Yi +--+ Yn 


converge with probability 1, and if so, what is the limit? 


Solution. We have 
(Xi +--+ Xn)/n 


Wa t+ Ya) /n 
By the strong law of large numbers, the numerator and denominator converge with 


probability 1 to E[X] and E[Y]. respectively. It follows that Zn converges to E[X]/E[Y], 
with probability 1 (cf. the preceding problem). 


Za 


Problem 15.* Suppose that a sequence Yı, Y2,... of random variables converges to 
a real number c, with probability 1. Show that the sequence also converges to c in 
probability. 


Solution. Let C be the event that the sequence of values of the random variables Yn 
converges to c. By assumption, we have P(C) = 1. Fix some e > 0, and let Ax be 
the event that |Yn — c| < « for every n > k. If the sequence of values of the random 
variables Y, converges to c, then there must exist some k such that for every n > k, 
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this sequence of values is within less than e from c. Therefore, every element of C 
belongs to A, for some k, or 
oc 
Cc |] A 
k=1 


Note also that the sequence of events A, is monotonically increasing, in the sense 
that Ak C A41 for all k. Finally, note that the event A, is a subset of the event 
(IY« — c| < €). Therefore, 


Jim P(|lY. d < €) > Jim P(A4) = P(U%21 Ak) > P(C) = 1, 


where the first equality uses the continuity property of probabilities (Problem 13 in 
Chapter 1). It follows that 


Jim P(|Ys — d 2 €) =0, 


which establishes convergence in probability. 


Problem 16.* Consider a sequence Y; of nonnegative random variables and suppose 


that 
E b» d « oo. 
n=l 
Show that Yn converges to 0, with probability 1. 


Note: This result provides a commonly used method for establishing convergence with 
probability 1. To evaluate the expectation of Qu nes Yn, one typically uses the formula 


E 5 - Y E[Y.]. 


The fact that the expectation and the infinite summation can be interchanged, for 
the case of nonnegative random variables, is known as the monotone convergence 
theorem. a fundamental result of probability theory, whose proof lies beyond the scope 
of this text. 


Solution. We note that the infinite sum 21 Y, must be finite, with probability 
1. Indeed. if it had a positive probability of being infinite, then its expectation would 
also be infinite. But if the sum of the values of the random variables Y, is finite, the 
sequence of these values must converge to zero. Since the probability of this event is 
equal to 1, it follows that the sequence Y, converges to zero, with probability 1. 


Problem 17.* Consider a sequence of Bernoulli random variables Xn, and let pn = 
P(Xn = 1) be the probability of success in the nth trial. Assuming that Peut Pn € oo, 
show that the number of successes is finite, with probability 1. [Compare with Problem 
48(b) in Chapter 1.] 


Solution. Using the monotone convergence theorem (see above note), we have 


E P» - EX] 2 M pn < œ. 
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This implies that 


oc 
» Xn < oo, 
nzl 


with probability 1. We then note that the event { pee, Ce oo} is the same as the 
event that there is a finite number of successes. 


Problem 18.* The strong law of large numbers. Let Xi, X2,... be a sequence 
of independent identically distributed random variables and assume that E[X#] < oo. 
Prove the strong law of large numbers. 


Solution. We note that the assumption E[X#] < oo implies that the expected value of 
the X; is finite. Indeed, using the inequality |z| < 1 + zf, we have 


E[|Xi|] < 1 + E[X?] < oo. 
Let us assume first that E[X;] = 0. We will show that 


d 


n=1 


We have 


(Xi + + Xn) 
p [BH] Ly Y Y Yr XXa Xa 
ij—-1i2—1i3-1i4-—1 
Let us consider the various terms in this sum. If one of the indices is different from 
all of the other indices, the corresponding term is equal to zero. For example, if 71 is 


different from i2, i3, or i4, the assumption E[X;] = 0 yields 
E[Xi, Xiz Xi, Xia] = E[X4 ]E[X; Xi, Xi] = 0. 
Therefore, the nonzero terms in the above sum are either of the form E[X?] (there are 
n such terms), or of the form E[X? X7], with 7 z j. Let us count how many terms 
there are of this form. Such terms are obtained in three different ways: by setting 
ij = i2 Æ 13 = i4, or by setting 2] = 23 Æ i2 = i4, or by setting i; = i4 # i2 = i3. For 
each one of these three ways, we have n choices for the first pair of indices, and n — 1 
choices for the second pair. We conclude that there are 3n(n — 1) terms of this type. 
Thus, 
E[(Xi + +--+ X4)*] = nE[X]] + 3n(n — 1)E[X? X3]. 

Using the inequality zy < (z? + y?)/2, we obtain E[X? X2] < E[X?], and 

E[(Xi +--+ + Xn)4] € (n+ 3n(n - 1))E[X7] € 3^ E[X1]. 
It follows that 


2» p E E[(X:+---+Xn)‘] < Y SEX] < 00, 
n=1 n= n=1 





where the last toD uses the well kid property A 7? < oo. This implies that 
(X147: - X4)*/n* converges to zero with probability 1 xu Problem 16), and therefore, 
(X1 +---+Xn)/n also converges to zero with probability 1, which is the strong law of 
large numbers. 

For the more general case where the mean of the random variables X; is nonzero, 
the preceding argument establishes that (X 1T XS-nE[X 1]) /n converges to zero, 
which is the same as (X1 +--- + Xn)/n converging to E[Xi], with probability 1. 


The Bernoulli and Poisson Processes 
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A stochastic process is a mathematical model of a probabilistic experiment that 
evolves in time and generates a sequence of numerical values. For example. a 
stochastic process can be used to model: 


(a) the sequence of daily prices of a stock; 

(b) the sequence of scores in a football game: 

(c) the sequence of failure times of a machine: 

(d) the sequence of hourly traffic loads at a node of a communication network; 
(e) the sequence of radar measurements of the position of an airplane. 


Each numerical value in the sequence is modeled by a random variable, so a 
stochastic process is simply a (finite or infinite) sequence of random variables 
and does not represent a major conceptual departure from our basic framework. 
We are still dealing with a single basic experiment that involves outcomes gov- 
erned by a probability law. and random variables that inherit their probabilistic 
properties from that law.1 

However. stochastic processes involve some change in emphasis over our 
earlier models. In particular: 


(a) We tend to focus on the dependencies in the sequence of values generated 
by the process. For example. how do future prices of a stock depend on 
past values? 


(b) We are often interested in long-term averages involving the entire se- 
quence of generated values. For example. what is the fraction of time that 
a machine is idle? 


(c) We sometimes wish to characterize the likelihood or frequency of certain 
boundary events. For example, what is the probability that within a 
given hour all circuits of some telephone system become simultaneously 
busy: or what is the frequency with which some buffer in a computer net- 
work overflows with data? 


There is a wide variety of stochastic processes. but in this book we will 
only discuss two major categories. 


(i) Arrival- Type Processes: Here. we are interested in occurrences that have 
the character of an “arrival.” such as message receptions at a receiver, job 
completions in a manufacturing cell. customer purchases at a store, etc. 
We will focus on models in which the interarrival times (the times between 
successive arrivals) are independent random variables. In Section 6.1, we 
consider the case where arrivals occur in discrete time and the interarrival 


T Let us emphasize that all of the random variables arising in a stochastic process 
refer to a single and common experiment, and are therefore defined on a common sample 
space. The corresponding probability law can be specified explicitly or implicitly (in 
terms of its properties), provided that it determines unambiguously the joint CDF of 
any subset of the random variables involved. 
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times are geometrically distributed - this is the Bernoulli process. In Sec- 
tion 6.2, we consider the case where arrivals occur in continuous time and 
the interarrival times are exponentially distributed - this is the Poisson 
process. 


(ii) Markov Processes: Here, we are looking at experiments that evolve in time 
and in which the future evolution exhibits a probabilistic dependence on 
the past. As an example, the future daily prices of a stock are typically 
dependent on past prices. However. in a Markov process, we assume a very 
special type of dependence: the next value depends on past values only 
through the current value. There is a rich methodology that applies to 
such processes, and is the subject of Chapter 7. 


61 THE BERNOULLI PROCESS 


The Bernoulli process can be visualized as a sequence of independent coin tosses, 
where the probability of heads in each toss is a fixed number p in the range 
0 « p « 1. In general, the Bernoulli process consists of a sequence of Bernoulli 
trials. Each trial produces a 1 (a success) with probability p, and a 0 (a failure) 
with probability 1 — p, independent of what happens in other trials. 

Of course, coin tossing is just a paradigm for a broad range of contexts 
involving a sequence of independent binary outcomes. For example, a Bernoulli 
process is often used to model systems involving arrivals of customers or jobs at 
service centers. Here, time is discretized into periods, and a "success" at the kth 
trial is associated with the arrival of at least one customer at the service center 
during the kth period. We will often use the term “arrival” in place of “success” 
when this is justified by the context. 

In a more formal description, we define the Bernoulli process as a sequence 
X1. X5,... of independent Bernoulli random variables X; with 


P(X; = 1) = P(success at the ith trial) = p, 
P(X; = 0) = P(failure at the ith trial) = 1 — p, 


for each i.t 

Given an arrival process, one is often interested in random variables such 
as the number of arrivals within a certain time period. or the time until the first 
arrival. For the case of a Bernoulli process, some answers are already available 
from earlier chapters. Here is a summary of the main facts. 


1 Generalizing from the case of a finite number of random variables, the inde- 
pendence of an infinite sequence of random variables X; is defined by the requirement 
that the random variables X)..... Xn be independent for any finite n. Intuitively, 
knowing the values of any finite subset of the random variables does not provide any 
new probabilistic information on the remaining random variables, and the conditional 
distribution of the latter is the same as the unconditional one. 
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Some Random Variables Associated with the Bernoulli Process 
and their Properties 


e The binomial with parameters p and n. This is the number S of 
successes in n independent trials. Its PMF, mean, and variance are 


ps(k) = (Sra —-pP-*, k 


E[S] = np, var(S) = np(1 — p). 


e The geometric with parameter p. This is the number T of trials 
up to (and including) the first success. Its PMF, mean, and variance 
are 


pr(t) = (1 — p)'-!p, 


E[T] = a var(T) 





Independence and Memorylessness 


The independence assumption underlying the Bernoulli process has important 
implications. including a memorylessness property (whatever has happened in 
past trials provides no information on the outcomes of future trials). An appre- 
ciation and intuitive understanding of such properties is very useful, and allows 
the quick solution of many problems that would be difficult with a more formal 
approach. In this subsection. we aim at developing the necessary intuition. 

Let us start by considering random variables that are defined in terms of 
what happened in a certain set of trials. For example. the random variable 
Z = (Xi + Xa) Xe X; is defined in terms of the first, third. sixth, and seventh 
trial. If we have two random variables of this type and if the two sets of trials 
that define them have no common elenient. then these random variables are 
independent. This is a generalization of a fact first seen in Chapter 2: if two 
random variables U and V are independent, then any two functions of them, 
g(U) and A(V ). are also independent. 


Example 6.1. 


(a) Let U be the number of successes in trials 1 to 5. Let V be the number of 
successes in trials 6 to 10. Then. U and V are independent. This is because 
U = Xi 4 Xs. V = Xo X10. and the two collections (Xi..... Xs). 


UXa.--.. Xio} have no common elements. 


(b) Let U (respectively. V) be the first odd (respectively, even) time i in which we 
have a success. Then. U is determined by the odd-time sequence X,.X3,..., 
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whereas V is determined by the even-time sequence X2, X4,.... Since these 
two sequences have no common elements, U and V are independent. 


Suppose now that a Bernoulli process has been running for n time steps, 
and that we have observed the values of X1, .X5,..., Xn. We notice that the 
sequence of future trials X541, Xn+2,... are independent Bernoulli trials and 
therefore form a Bernoulli process. In addition, these future trials are indepen- 
dent from the past ones. We conclude that starting from any given point in 
time, the future is also modeled by a Bernoulli process, which is independent of 
the past. We refer to this loosely as the fresh-start property of the Bernoulli 
process. 

Let us now recall that the time T' until the first success is a geometric 
random variable. Suppose that we have been watching the process for n time 
steps and no success has been recorded. What can we say about the number T-n 
of remaining trials until the first success? Since the future of the process (after 
time n) is independent of the past and constitutes a fresh-starting Bernoulli 
process, the number of future trials until the first success is described by the 
same geometric PMF. Mathematically, we have 


P(T-n-2t|T»n)-(1—-p)-!p-P(T-t) telo 


We refer to this as the memorylessness property. It can also be derived al- 
gebraically, using the definition of conditional probabilities, but the argument 
given here is certainly more intuitive. 


Independence Properties of the Bernoulli Process 


e For any given time n, the sequence of random variables X541, Xn+2,... 
(the future of the process) is also a Bernoulli process, and is indepen- 
dent from X),...,Xn (the past of the process). 


e Let n be a given time and let T be the time of the first success after 
time n. Then, T' — n has a geometric distribution with parameter p, 
and is independent of the random variables X;,..., Xn. 








Example 6.2. A computer executes two types of tasks, priority and nonpriority, 
and operates in discrete time units (slots). A priority task arrives with probability 
p at the beginning of each slot, independent of other slots, and requires one full slot 
to complete. A nonpriority task is always available and is executed at a given slot 
if no priority task is available. In this context, it may be important to know the 
probabilistic properties of the time intervals available for nonpriority tasks. 

With this in mind, let us call a slot busy if within this slot, the computer 
executes a priority task, and otherwise let us call it idle. We call a string of idle 
(or busy) slots, flanked by busy (or idle, respectively) slots. an idle period (or busy 
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period, respectively). Let us derive the PMF, mean, and variance of the following 
random variables (cf. Fig. 6.1): 


(a) T = the time index of the first idle slot; 


(b) B =the length (number of slots) of the first busy period; 
(c) I = the length of the first idle period. 
(d) Z = the number of slots after the first slot of the first busy period up to and 


including the first subsequent idle slot. 








B I 
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Figure 6.1: Illustration of random variables, and busy and idle periods in 
Example 6.2. In the top diagram, T = 4, B = 3, I = 2, and Z = 3. In the 
bottom diagram, T = 1, / = 5, B = 4, and Z = 4. 


We recognize T as a geometrically distributed random variable with param- 
eter 1 — p. Its PMF is 


pr(k)-p"(1-p.  k=1,2,.... 
Its mean and variance are 


E[T) = A var(T) — meos 


Let us now consider the first busy period. It starts with the first busy slot, 
call it slot L. (In the top diagram in Fig. 6.1, L = J; in the bottom diagram, L = 6.) 
The number Z of subsequent slots until (and including) the first subsequent idle 
slot has the same distribution as T, because the Bernoulli process starts fresh at 
time L +1. We then notice that Z = B and conclude that B has the same PMF 
as T. 

If we reverse the roles of idle and busy slots, and interchange p with 1 — p, we 
see that the length 7 of the first idle period has the same PMF as the time index 
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of the first busy slot, so that 
- 1 1- 
pi(k)2(0- pp, k212,...,  E[l]- = var(I) = ae 
We finally note that the argument given here also works for the second, third, 
etc., busy (or idle) period. Thus, the PMFs calculated above apply to the ith busy 


and idle period, for any i. 





If we start watching a Bernoulli process at a certain time n, what we see 
is indistinguishable from a Bernoulli process that has just started. It turns out 
that the same is true if we start watching the process at some random time N, 
as long as N is determined only by the past history of the process and does 
not convey any information on the future. Indeed, such a property was used in 
Example 6.2, when we stated that the process starts fresh at time L + 1. For 
another example, consider a roulette wheel with each occurrence of red viewed 
as a success. The sequence generated starting at some fixed spin (say, the 25th 
spin) is probabilistically indistinguishable from the sequence generated starting 
immediately after red occurs in five consecutive spins. In either case, the process 
starts fresh (although one can certainly find gamblers with alternative theories). 
The next example makes a similar argument, but more formally. 


Example 6.3. Fresh-Start at a Random Time. Let N be the first time that 
we have a success immediately following a previous success. (That is, N is the first 
i for which Xi- = X; = 1.) What is the probability P(Xn+41 = Xn+2 = 0) that 
there are no successes in the two trials that follow? 

Intuitively, once the condition Xn-, = Xn = 1 is satisfied, and from then 
on, the future of the process consists of independent Bernoulli trials. Therefore, the 
probability of an event that refers to the future of the process is the same as in a 
fresh-starting Bernoulli process, so that P(Xn+1 = Xn+2 = 0) = (1— p)?. 

To provide a rigorous justification of the above argument, we note that the 
time N is a random variable, and by conditioning on the possible values of N, we 
have 


P(Xwa1 = Xn42 20) = M P(N = n)P(Xw4i = Xn42 20|N =n) 


= Y P(N = n)P(Xs41 = Xn42 20|N =n). 


n=1 
Because of the way that N was defined, the event {N = n} occurs if and only if 
the values of X1,...,Xn satisfy a certain condition. But these random variables 


are independent of Xn+ı and Xn+2. Therefore, 
P(Xn41 = Xni2 = 0|N = n) = P(Xn41 = Xn42 = 0) = (1 —p)’, 


which leads to 


P(Xwa1 = Xv43 = 0) = M P(N 2 n) - p? = (1 - py. 


n=1 
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Interarrival Times 


An important random variable associated with the Bernoulli process is the time 
of the kth success (or arrival), which we denote by Yp. A related random variable 
is the kth interarrival time. denoted by T}. It is defined by 


Ti = Yi. Ty = Yk — Yu-i, kA s 


and represents the number of trials following the (k — 1)st success until the next 
success. See Fig. 6.2 for an illustration, and also note that 


Y. Ti - To T. 


Y 


faa a gia 


OPO, 1] OPOPOUPOTI JOTI Fl Poo 


l et ma 
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Figure 6.2: Illustration of interarrival times, where a 1 represents an arrival. In 
this example, T1 = 3, To = 5, T4 = 2, Ta = 1. Furthermore, Y; = 3, Y2 = 8, 
Y3 210, Y4 = 11. 


We have already seen that the time T until the first success is a geometric 
random variable with parameter p. Having had a success at time T3, the future is 
a new Bernoulli process, similar to the original: the number of trials Tz until the 
next success has the same geometric PMF. Furthermore, past trials (up to and 
including time T1) are independent of future trials (from time 7} + 1 onward). 
Since Tz is determined exclusively by what happens in these future trials, we 
see that Tz is independent of Ti. Continuing similarly, we conclude that the 
random variables Ti. T2. T3.... are independent and all have the same geometric 
distribution. 

This important observation leads to an alternative. but equivalent way of 
describing the Bernoulli process, which is sometimes more convenient. 


Alternative Description of the Bernoulli Process 


1. Start with a sequence of independent geometric random variables T3, 
T2,..., with common parameter p, and let these stand for the interar- 
rival times. 


2. Record a success (or arrival) at times Tj, T3 + T2, Ty + T» +T, etc. 
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Example 6.4. It has been observed that after a rainy day. the number of days 
until it rains again is geometrically distributed with parameter p. independent of 
the past. Find the probability that it rains on both the 5th and the 8th day of the 
month. 

If we attempt to approach this problem by manipulating the geometric PMFs 
in the problem statement. the solution is quite tedious. However, if we view rainy 
days as “arrivals,” we notice that the description of the weather conforms to the al- 
ternative description of the Bernoulli process given above. Therefore, any given day 
is rainy with probability p. independent of other days. In particular. the probability 
that days 5 and 8 are rainy is equal to p?. 


The kth Arrival Time 


The time Y, of the kth success (or arrival) is equal to the sum Y; = Ti + T2 + 
--» + Ty of k independent identically distributed geometric random variables. 
This allows us to derive formulas for the mean. variance. and PMF of Y}. which 
are given in the table that follows. 


Properties of the kth Arrival Time 
e The kth arrival time is equal to the sum of the first k interarrival times 
Yk = Ta + To2+---+ Tk, 


and the latter are independent geometric random variables with com- 
mon parameter p. 


e The mean and variance of Y, are given by 


E|Y;]— EJT] E E[7;] = =, 


var(Y&) = var(T1) + --- + var(Tk) = eee 


e The PMF of Y, is given by 


py, (t) = (21 Ph -p t=kk+1,..., 


and is known as the Pascal PMF of order k. 





To verify the formula for the PMF of Yk, we first note that Yẹ cannot be 
smaller than k. For t > k. we observe that the event (Y, = t) (the kth success 
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comes at time t) will occur if and only if both of the following two events A and 
B occur: 


(a) event A: trial t is a success: 
(b) event B: exactly k — 1 successes occur in the first t — 1 trials. 


The probabilities of these two events are 
P(A) = p, 


and beet 
P(E) = (1-1 eta ne. 


respectively. In addition. these two events are independent (whether trial t is a 
success or not is independent of what happened in the first t—1 trials). Therefore, 


pra(t) = P( = 0) = P(AN B) = P(A)P(B) = (1-1) - p): 


as claimed. 


Example 6.5. In each minute of basketball play, Alicia commits a single foul with 
probability p and no foul with probability 1 — p. The number of fouls in different 
minutes are assumed to be independent. Alicia will foul out of the game once she 
commits her sixth foul. and will play 30 minutes if she does not foul out. What is 
the PMF of Alicia's playing time? 

We model fouls as a Bernoulli process with parameter p. Alicia's playing time 
Z is equal to Ye. the time until the sixth foul, except if Ys is larger than 30, in which 
case. her playing time is 30; that is, Z = min(Ys,30). The random variable Ys has 
a Pascal PMF of order 6, which is given by 


t-1 " 
m= ( E jra- t WSE 


To determine the PMF pz(z) of Z, we first consider the case where z is between 6 
and 29. For z in this range, we have 


pz(z)-—-P(Z232)-P($522)— C A rta — p) 5, C= Oli cas 29: 


The probability that Z = 30 is then determined from 


29 
pz(30) 21- ` pz(2). 
z=6 
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Splitting and Merging of Bernoulli Processes 


Starting with a Bernoulli process in which there is a probability p of an arrival 
at each time, consider splitting it as follows. Whenever there is an arrival. we 
choose to either keep it (with probability q). or to discard it (with probability 
1 — q): see Fig. 6.3. Assume that the decisions to keep or discard are independent 
for different arrivals. If we focus on the process of arrivals that are kept, we see 
that it is a Bernoulli process: in each time slot. there is a probability pq of a 
kept arrival, independent of what happens in other slots. For the same reason. 
the process of discarded arrivals is also a Bernoulli process. with a probability 
of a discarded arrival at each time slot equal to p(1 — q). 





! Fog Line 





ol lo olo O 


Process 





Figure 6.3: Splitting of a Bernoulli process. 


In a reverse situation. we start with two independent Bernoulli processes 
(with parameters p and q. respectively) and merge them into a single process. 
as follows. An arrival is recorded in the merged process if and onlv if there 
is an arrival in at least one of the two original processes. This happens with 
probability p + q — pq [one minus the probability (1 — p)(1 — q) of no arrival in 
either process]. Since different time slots in either of the original processes are 
independent. different slots in the merged process are also independent. Thus. 
the merged process is Bernoulli. with success probability p + q — pg at cach time 
step: see Fig. 6.4. 

Splitting and merging of Bernoulli (or other) arrival processes arises in 
many contexts. For example. a two-machine work center may see a stream of 
arriving parts to be processed and split them by sending each part to a randomly 
chosen machine. Conversely, a machine may be faced with arrivals of different 
types that can be merged into a single axrival stream. 


The Poisson Approximation to the Binomial 


The number of successes in n independent Bernoulli trials is a binomial random 
variable with parameters n and p. and its mean is np. In this subsection. we 
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Figure 6.4: Merging of independent Bernoulli processes. 


concentrate on the special case where n is large but p is small, so that the mean 
np has a moderate value. A situation of this type arises when one passes from 
discrete to continuous timc. a theme to be picked up in the next section. For 
some examples, think of the number of airplane accidents on any given day: 
there is a large number n of trials (airplane flights). but each one has a very 
small probability p of being involved in an accident. Or think of counting the 
number of typos in a book: there is a large number of words. but a very small 
probability of misspelling any single one. 

Mathematically, we can address situations of this kind. by letting n grow 
while simultaneously decreasing p. in a manner that keeps the product np at a 
constant value A. In the limit. it turns out that the formula for the binomial PMF 
simplifies to the Poisson PMF. A precise statement is provided next. together 
with a reminder of some of the properties of the Poisson PMF that were derived 
in Chapter 2. 





Poisson Approximation to the Binomial 


€ À Poisson random variable Z with parameter A takes nonnegative 
integer values and is described by the PMF 


k 
pz(k) — eM k —0,1,2,.... 


Its mean and variance are given by 


bp. Wap 
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e For any fixed nonnegative integer k, the binomial probability 
n! 
ps(k) = PEE -ptd = p)" 


converges to pz(k), when we take the limit as n — oo and p = A/n, 
while keeping A constant. 


e In general, the Poisson PMF is a good approximation to the binomial 
as long as À = np, n is very large, and p is very small. 





To verify the validity of the Poisson approximation, we let p — A/n and 
note that 





ps(k) = hob ape Cp) i 
_n(n-1) e (n-k+1) M. AYSA 
TOO k nk (1-3) 

n (n-1) (n-—k+1) A ( T 
A A ee a f 
n n n k! n 


Let us focus on a fixed k and let n — oo. Each one of the ratios (n — 1)/n, 
(n — 2)/n,..., (n — k + 1)/n converges to 1. Furthermore,! 


—k n 
(: — *) =l, ( — à) = e, 
n n 
We conclude that for each fixed k, and as n — oo, we have 


AK 
ps(k) — € MD 


Example 6.6. As a rule of thumb, the Poisson/binomial approximation 


EAT n! 
x Aaa nnck = 
eU (n — Ej püu-p" k=0,1,...,n, 


is valid to several decimal places if n > 100, p € 0.01, and A = np. To check this, 
consider the following. 

Gary Kasparov, a world chess champion, plays against 100 amateurs in a large 
simultaneous exhibition. It has been estimated from past experience that Kasparov 


t We are using here, the well known formula limz.;5;(1 — iy =e’. Letting 
zr = n/A, we have lima ..55(1 — Ayr/a =e ', and limn+o(1— ài) Ep 
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wins in such exhibitions 99% of his games on the average (in precise probabilistic 
terms, we assume that he wins each game with probability 0.99, independent of 
other games). What are the probabilities that he will win 100 games, 98 games, 95 
games, and 90 games? 

We model the number of games X that Kasparov does not win as a binomial 
random variable with parameters n = 100 and p = 0.01. Thus the probabilities 
that he will win 100 games, 98, 95 games, and 90 games are 


px (0) = (1— 0.01)!9? = 0.366, 


|. 100! , oc 
ER CURE ee 
px(5) = zgra :0.01* (1 — 0.01)” = 0.00290, 
100! 10 —8 
px(10) = garīgi : 0.01 (1 — 0.01)°° = 7.006 - 107*, 


respectively. Now let us check the corresponding Poisson approximations with A = 
100-0.01 = 1. They are: 


ud 
pz(0) =e “ai = 0.368, 
1 
pz(2) =e E — 0.184, 
pz(5) = elas = 0.00306, 


-, 1 -8 
pz(10) =e ion 1.001. 1077. 
By comparing the binomial PMF values px(k) with their Poisson approximations 
pz(k), we see that there is close agreement. 

Suppose now that Kasparov plays simultaneously against just 5 opponents. 
who are, however, stronger so that his probability of a win per game is 0.9. Here 
are the binomial probabilities px (k) for n = 5 and p = 0.1, and the corresponding 
Poisson approximations pz(k) for A = np = 0.5: 














We see that the approximation, while not poor. is considerably less accurate than 
in the case where n = 100 and p = 0.01. 


Example 6.7. A packet consisting of a string of n symbols is transmitted over 
a noisy channel. Each symbol has probability p — 0.0001 of being transmitted in 
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error, independent of errors in the other symbols. How small should n be in order 
for the probability of incorrect transmission (at least one symbol in error) to be less 
than 0.001? 

Each symbol transmission is viewed as an independent Bernoulli trial. Thus, 
the probability of a positive number S of errors in the packet is 


1- P(S-0)21-(1-p)". 


For this probability to be less than 0.001. we must have 1 — (1 — 0.0001)" « 0.001 


Or 
In 0.999 


In 0.9999 


We can also use the Poisson approximation for P(S = 0), which is e^^ with A = 
np = 0.0001 - n, and obtain the condition 1 — e 99"?! < 0.001. which leads to 


— 10.0045. 


— In 0.999 


0.0000 ^ 10.005. 


Given that n must be integer, both methods lead to the same conclusion that n 
can be at most 10. 


6.2 THE POISSON PROCESS 


The Poisson process is a continuous-time analog of the Bernoulli process and 
applies to situations where there is no natural way of dividing time into discrete 
periods. 

To see the need for a continuous-time version of the Bernoulli process. let 
us consider a possible model of traffic accidents within a city. We can start by 
discretizing time into one-minute periods and record a "success" during every 
minute in which there is at least one traffic accident. Assuming the traffic in- 
tensity to be constant over time, the probability of an accident should be the 
same during each period. Under the additional (and quite plausible) assumption 
that different time periods are independent. the sequence of successes becomes a 
Bernoulli process. Note that in real life. two or more accidents during the same 
one-minute interval are certainly possible, but the Bernoulli process model does 
not keep track of the exact number of accidents. In particular. it does not allow 
us to calculate the expected number of accidents within a given period. 

One way around this difficulty is to choose the length of a time period to be 
very small. so that the probability of two or more accidents becomes negligible. 
But how small should it be? A second? A millisecond? Instead of making an 
arbitrary choice, it is preferable to consider a limiting situation where the length 
of the time period becomes zero. and work with a continuous-time model. 

We consider an arrival process that evolves in continuous time. in the sense 
that any real number t is a possible arrival time. We define 


P(k. 7) = P(there are exactly k arrivals during an interval of length 7), 
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and assume that this probability is the same for all intervals of the same length 7. 
We also introduce a positive parameter A, called the arrival rate or intensity 
of the process, for reasons that will soon become apparent. 


Definition of the Poisson Process 


An arrival process is called a Poisson process with rate A if it has the fol- 
lowing properties: 


(a) (Time-homogeneity) The probability P(k,7) of k arrivals is the 
same for all intervals of the same length 7. 


(b) (Independence) The number of arrivals during a particular interval 
is independent of the history of arrivals outside this interval. 


(c) (Small interval probabilities) The probabilities P(k, 7) satisfy 


P(0,7) 21—Ar-o(r), 
P(1,7) = Ar + a(7), 
P(k, T) = ox(7), for k 2 2,3,... 


Here, o(7) and o&(7) are functions of 7 that satisfy 





The first property states that arrivals are “equally likely" at all times. The 
arrivals during any time interval of length 7 are statistically the same, i.e., they 
obey the same probability law. This is a counterpart to the assumption that the 
success probability p in a Bernoulli process is the same for all trials. 

To interpret the second property, consider a particular interval [t, t/], of 
length t’ — t. The unconditional probability of k arrivals during that interval 
is P(k,t' — t). Suppose now that we are given complete or partial information 
on the arrivals outside this interval. Property (b) states that this information 
is irrelevant: the conditional probability of k arrivals during [t, t’] remains equal 
to the unconditional probability P(k,t' — t). This property is analogous to the 
independence of trials in a Bernoulli process. 

The third property is critical. The o(7) and o&(7) terms are meant to be 
negligible in comparison to 7, when the interval length 7 is very small. They can 
be thought of as the O(7?) terms in a Taylor series expansion of P(k, 7). Thus, 
for small 7, the probability of a single arrival is roughly A7, plus a negligible 
term. Similarly, for small 7, the probability of zero arrivals is roughly 1 — Ar. 
Finally, the probability of two or more arrivals is negligible in comparison to 
P(1,7), as T becomes smaller. 
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Number of Arrivals in an Interval 


We will now derive some probability distributions associated with the arrivals 
in à Poisson process. We first use the connection with the Bernoulli process to 
obtain the PMF of the number of arrivals in a given time interval. 

Let us consider a fixed time interval of length 7, and partition it into 7/é 
periods of length 6. where 6 is a very small number: see Fig. 6.5. The probability 
of more than two arrivals during any period can be neglected, because of property 
(c) and the preceding discussion. Different periods are independent, by property 
(b). Furthermore, each period has one arrival with probability approximately 
equal to Ad, or zero arrivals with probability approximately equal to 1 — A6. 
Therefore, the process being studied can be approximated by a Bernoulli process, 
with the approximation becoming more and more accurate as 6 becomes smaller. 


Number Probability of success Expected number 
of periods: per period: of arrivals: 
ns xf p=hd npzc 





0 “Se t time 
Arrivals 


Figure 6.5: Bernoulli approximation of the Poisson process over an interval of 
length 7. 


The probability P(k, 7) of k arrivals in time 7 is approximately the same as 
the (binomial) probability of k successes in n = 7/ó independent Bernoulli trials 
with success probability p = Aó at each trial. While keeping the length 7 of the 
interval fixed, we let the period length ó decrease to zero. We then note that the 
nuinber n of periods goes to infinity, while the product zip remains constant and 
equal to Ar. Under these circumstances. we saw in the previous section that the 
binomial PMF converges to à Poisson PMF with parameter Ar. We are then led 
to the important conclusion that 

(Ar)* 


— p-AT = 
P(k,r) =e DE. k =0,1,.... 


Note that a Taylor series expansion of e-^7 yields 


P(0,7) =e 21-Ar-o(r), 
P(1, T) = XAre-^* = Ar — A?7? + O(3) = Ar + 01(7), 


consistent with property (c). 
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Using our earlier formulas for the mean and variance of the Poisson PMF. 
we obtain 
E[N;] = Ar. var(.N,) = Ar. 


where N, denotes the number of arrivals during a time interval of length T. 
These formulas are hardly surprising. since we are dealing with the limit of a 
binomial PMF with parameters n = 7/6, p = Ad. mean np = Ar, and variance 
np(l— p) np = Ar. 

Let us now derive the probability law for the time T' of the first arrival, 
assuming that the process starts at time zero. Note that we have T' > t if and 
only if there are no arrivals during the interval [0. t]. Therefore. 


Fr(t)=P(T < t) Z 1- P(T»t) 2 1- P(0,t 21— e-^*. t 2 0. 
We then differentiate the CDF Fr(t) of T, and obtain the PDF formula 
fr(t) = Ae-^t. t 2 0. 


which shows that the time until the first arrival is exponentially distributed with 
parameter A. We summarize this discussion in the table that follows. See also 
Fig. 6.6. 


Random Variables Associated with the Poisson Process and their 
Properties 


e The Poisson with parameter A7. This is the number N+ of arrivals 
in a Poisson process with rate A, over an interval of length 7. Its PMF, 
mean, and variance are 


k 
pn, (k) = P(k,r) = ex k=0,1,... 


E[N;.] = Ar, var(N;) = Ar. 


e The exponential with parameter A. This is the time T' until the 
first arrival. Its PDF, mean, and variance are 


fr(t) 2 Aet, t>0, 





Example 6.8. You get email according to a Poisson process at a rate of A — 0.2 
messages per hour. You check your email every hour. What is the probability of 
finding 0 and 1 new messages" 
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Figure 6.6: View of the Bernoulli process as the discrete-time version of the 
Poisson process. We discretize time in small intervals ó and associate each interval 
with a Bernoulli trial whose parameter is p — Aó. The table summarizes some of 


the basic correspondences. 
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These probabilities can be found using the Poisson PMF e^?^" (Ar)*/k!. with 


Tzl,andkz0orkzlt 


P(0.1) = e*9? = 0.819, 


P(1.1) 20.2.e 9? = 0.164. 


Suppose that you have not checked your email for a whole day. What is the 
probability of finding no new messages? We use again the Poisson PMF and obtain 


P(0,24) = e^9??* = 0.0083. 


Alternatively, we can argue that the event of no messages in a 24-hour period is the 
intersection of the events of no messages during each of 24 hours. The latter events 


-0.2 


are independent and the probability of each is P(0.1) =e , SO 


P(0,24) = (P(0, 1))^* = (e-°)™ = 0.0083, 


which is consistent with the preceding calculation method. 


Example 6.9. The Sum of Independent Poisson Random Variables is 
Poisson. Arrivals of customers at the local supermarket are modeled by a Poisson 
process with a rate of A = 10 customers per minute. Let M be the number of 
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customers arriving between 9:00 and 9:10. Also. let N be the number of customers 
arriving between 9:30 and 9:35. What is the distribution of M + N? 

We notice that M is Poisson with parameter u = 10-10 = 100 and N is Poisson 
with parameter v = 10-5 = 50. Furthermore. M and N are independent. As shown 
in Section 4.4, using transforms, M + N is Poisson with parameter u +v = 150 (see 
also Probem 11 in Chapter 4). We will now proceed to derive the same result in a 
more direct and intuitive manner. 

Let N be the number of customers that arrive between 9:10 and 9:15. Note 
that N has the same distribution as N (Poisson with parameter 50). Furthermore, 
N is also independent of Af. Thus. the distribution of M + N is the same as the 
distribution of M + Ñ. But M + Ñ is the number of arrivals during an interval of 
length 15. and has therefore a Poisson distribution with parameter 10-15 = 150. 

This example makes a point that is valid in general. The probability of k 
arrivals during a set of times of total length 7 is always given by P(k.7), even if 
that set is not an interval. (In this example. we dealt with the set [9:00.9:10]U 
[9:30.9:35]. of total length 15.) 


Independence and Memorylessness 


The Poisson process has several properties that parallel those of the Bernoulli 
process, including the independence of nonoverlapping time sets, and the mem- 
orylessness of the interarrival time distribution. Given that the Poisson process 
can be viewed as a limiting case of a Bernoulli process. the fact that it inherits 
the qualitative properties of the latter should be hardly surprising. 


Independence Properties of the Poisson Process 


e For any given time t > 0, the history of the process after time ¢ is also 
a Poisson process, and is independent from the history of the process 
until time t. 


e Let t be a given time and let T be the time of the first arrival after 
time t. Then, T — t has an exponential distribution with parameter A, 
and is independent of the history of the process until time t. 





The first property in the above table is established by observing that the 
portion of the process that starts at time t satisfies the properties required by 
the definition of the Poisson process. The independence of the future from the 
past is a direct consequence of the independence assumption in the definition 


of the Poisson process. Finally, the fact that T' — t has the same exponential 
distribution can be verified by noting that 


P(T - t > s) = P (0 arrivals during [t,t + s]) = P(0, s) = e-^s. 


This is the memorylessness property, which is analogous to the one for the 
Bernoulli process. The following examples make use of this property. 
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Example 6.10. You and your partner go to a tennis court, and have to wait until 
the players occupying the court finish playing. Assume (somewhat unrealistically) 
that their playing time has an exponential PDF. Then, the PDF of your waiting 
time (equivalently, their remaining playing time) also has the same exponential 
PDF, regardless of when they started playing. 


Example 6.11. When you enter the bank, you find that all three tellers are busy 
serving other customers, and there are no other customers in queue. Assume that 
the service times for you and for each of the customers being served are independent 
identically distributed exponential random variables. What is the probability that 
you will be the last to leave? 

The answer is 1/3. To see this, focus at the moment when you start service 
with one of the tellers. Then, the remaining time of each of the other two customers 
being served, as well as your own remaining time, have the same PDF. Therefore, 
you and the other two customers have equal probability 1/3 of being the last to 
leave. 


Interarrival Times 


An important random variable associated with a Poisson process that starts at 
time 0, is the time of the kth arrival, which we denote by Yk. A related random 
variable is the kth interarrival time, denoted by Tx. It is defined by 


D =Y, Tk = Yk — Yk-1, k 22,8,... 


and represents the amount of time between the (k — 1)st and the kth arrival. 
Note that 


Yk —- Ti + To +--+ Tk. 


We have already seen that the time T; until the first arrival is an exponen- 
tial random variable with parameter A. Starting from the time T' of the first 
arrival, the future is a fresh-starting Poisson process. t Thus, the time until the 
next arrival has the same exponential PDF. Furthermore, the past of the process 
(up to time T1) is independent of the future (after time T1). Since T? is deter- 
mined exclusively by what happens in the future, we see that Tz is independent 
of Ti. Continuing similarly, we conclude that the random variables Ti, T2, T3, . .. 
are independent and all have the same exponential distribution. 

This important observation leads to an alternative, but equivalent, way of 
describing the Poisson process. 


t This statement is a bit stronger than the fact, discussed earlier, that starting 
from any given deterministic time t the process starts fresh, but is quite intuitive. It 
can be formally justified using an argument analogous to the one in Example 6.3, by 
conditioning on all possible values of the random variable 7}. 
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Alternative Description of the Poisson Process 


1. Start with a sequence of independent exponential random variables 
T1, 15,..., with common parameter A, and let these represent the in- 
terarrival times. 


2. Record an arrival at times Ti, T1 + T2, Ti + T2 + T3, etc. 








The kth Arrival Time 


The time Y, of the Ath arrival is equal to the sum Y, = Tı + To +--- + Ty of 
k independent identically distributed exponential random variables. This allows 
us to derive formulas for the mean. variance. and PDF of Y}. which are given in 
the table that follows. 


Properties of the kth Arrival Time 


e The kth arrival time is equal to the sum of the first k interarrival times 
Y; — Ti - To Tk, 


and the latter are independent exponential random variables with com- 
mon parameter A. 


e The mean and variance of Y; are given by 


E[Y.] = E[T] + +--+ E[Tk] = s. 


var(Y;) = var(T1) +--- + var(Tk) = 5 
e The PDF of Y; is given by 


Akyk-le-Ay 
=. oa 2 
fv, (v) k-ir^ ?? 0, 


and is known as the Erlang PDF of order k. 





To evaluate the PDF fy, of Yk. we argue that for a small ô, the product 
ô- fy, (y) approximates the probability that the kth arrival occurs between times 
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y and y + 6.1 When 6 is very small. the probability of more than one arrival 
during the interval [y. y + 6] is negligible. Thus, the kth arrival occurs between 
y and y + ó if and only if the following two events A and B occur: 


(a) event A: there is an arrival during the interval [y, y + 6]; 
(b) event B: there are exactly k — 1 arrivals before time y. 


The probabilities of these two events are 


\k-lyk-le-Ay 


P(A)&Aó and  P(B)-P(k-1.y)— DEC. 


Since A and B are independent, we have 


ô fy, (y) ~ Ply < Yk < y +ô) = P(A N B) = P(A) P(B) ~ M CDU ? 


from which we obtain 


Akyk-le-Ay 
= ——————— 20. 
fv, (y) k-i ^ ?? 0 


Example 6.12. You call the IRS hotline and you are told that you are the 56th 
person in line, excluding the person currently being served. Callers depart according 
to a Poisson process with a rate of A = 2 per minute. How long will you have to 
wait on the average until your service starts, and what is the probability you will 
have to wait for more than 30 minutes? 

By the memorylessness property, the remaining service time of the person 
currently being served is exponentially distributed with parameter 2. The service 
times of the 55 persons ahead of you are also exponential with the same parameter, 


1 For an alternative derivation that does not rely on approximation arguments, 
note that for a given y > 0, the event (Y, < y) is the same as the event 


(number of arrivals in the interval [0, y] is at least k}. 


Thus, the CDF of Y, is given by 


Ay)? e 
Fy, (y) =P (Yi < v) =S Pn ZINC jepcy U 
n=0 i 
The PDF of Y, can be obtained by fiiis the above expression. which by 
straightforward calculation yields the Erlang PDF formula 
d Mahe 1e-AÀu 
fy, (y) = ay FQ? STER 
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and all of these random variables are independent. Thus, your waiting time in 
minutes, call it Y, is Erlang of order 56, and 


EY] = D = 28 


The probability that you have to wait for more than 30 minutes is given by the 
formula oc 
56155 9— Ay 


Bar OY 


P(Y > 30) = | 
30 
Computing this probability is quite tedious. On the other hand, since Y is the sum 
of a large number of independent identically distributed random variables, we can 
use an approximation based on the central limit theorem and the normal tables. 


Splitting and Merging of Poisson Processes 


Similar to the case of a Bernoulli process, we can start with a Poisson process 
with rate A and split it. as follows: each arrival is kept with probability p and 
discarded with probability 1 — p, independent of what happens to other arrivals. 
In the Bernoulli case, we saw that the result of the splitting was also a Bernoulli 
process. In the present context, the result of the splitting turns out to be a 
Poisson process with rate Ap. 

Alternatively, we can start with two independent Poisson processes, with 
rates Àj and A2, and merge them by recording an arrival whenever an arrival 
occurs in either process. It turns out that the merged process is also Poisson 
with rate A; + A2. Furthermore, any particular arrival of the merged process has 
probability A1/(A1 + A2) of originating from the first process, and probability 
A2/(A1 + A2) of originating from the second, independent of all other arrivals 
and their origins. 

We discuss these properties in the context of some examples, and at the 
same time provide the arguments that establish their validity. 


Example 6.13. Splitting of Poisson Processes. A packet that arrives at a 
node of a data network is either a local packet that is destined for that node (this 
happens with probability p), or else it is a transit packet that must be relayed to 
another node (this happens with probability 1 — p). Packets arrive according to a 
Poisson process with rate A, and each one is a local or transit packet independent of 
other packets and of the arrival times. As stated above, the process of local packet 
arrivals is Poisson with rate Ap. Let us see why. 

We verify that the process of local packet arrivals satisfies the defining prop- 
erties of a Poisson process. Since A and p are constant (do not change with time), 
the first property (time-homogeneity) clearly holds. Furthermore, there is no de- 
pendence between what happens in disjoint time intervals, verifying the second 
property. Finally, if we focus on a small interval of length ô, the probability of a 
local arrival is approximately the probability that there is a packet arrival, and that 
this turns out to be a local one, i.e., Aó-p. In addition, the probability of two or more 
local arrivals is negligible in comparison to ó, and this verifies the third property. 
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We conclude that local packet arrivals form a Poisson process and, in particular, 
the number of such arrivals during an interval of length 7 has a Poisson PMF with 
parameter pàr. By a symmetrical argument, the process of transit packet arrivals 
is also Poisson, with rate A(1 — p). A somewhat surprising fact in this context is 
that the two Poisson processes obtained by splitting an original Poisson process are 
independent: see the end-of-chapter problems. 


Example 6.14. Merging of Poisson Processes. People with letters to mail 
arrive at the post office according to a Poisson process with rate A1, while people 
with packages to mail arrive according to an independent Poisson process with rate 
A2. As stated earlier the merged process, which includes arrivals of both types, is 
Poisson with rate A1 + A2. Let us see why. 

First, it should be clear that the merged process satisfies the time-homogeneity 
property. Furthermore. since different intervals in each of the two arrival processes 
are independent, the same property holds for the merged process. Let us now focus 
on a small interval of length ô. Ignoring terms that are negligible compared to ô, 
we have 


P(0 arrivals in the merged process) z (1 — A16)(1 — A26) z 1 — (A1 + A2)6, 
P(1 arrival in the merged process) © A16(1 — A26) + (1 — 410)A20 z (A1 + A2)6, 


and the third property has been verified. 

Given that an arrival has just been recorded, what is the probability that it 
is an arrival of a person with a letter to mail? We focus again on a small interval 
of length ó around the current time, and we seek the probability 


P(1 arrival of person with a letter | 1 arrival). 


Using the definition of conditional probabilities, and ignoring the negligible proba- 
bility of more than one arrival, this is 


P(1 arrival of person with a letter) n: A1ó | A 
P(1 arrival) ~ (Or +A2)6 AtA 


Generalizing this calculation, we let Ly be the event that the kth arrival corresponds 
to an arrival of a person with a letter to mail, and we have 


Ai 


ee ak OES TS, 


Furthermore, since distinct arrivals happen at different times, and since, for Poisson 
processes, events at different times are independent, it follows that the random 
variables Lı, L2,... are independent. 


Example 6.15. Competing Exponentials. Two light bulbs have independent 
and exponentially distributed lifetimes Tae and Tp, with parameters Aq and Ab, 
respectively. What is the distribution of Z = min{T., T»), the first time when a 
bulb burns out? 
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For all z > 0, we have, 
Fz(z) = P(min{Ta, To} < z) 
—1- P(min(T;, Te} > 2) 
=1- P(T. >z, Ty > 2) 


=1- P(T. > z)P(T > 2) 


Sja e297 eT >b? 


—1-g O29. 


This is recognized as the exponential CDF with parameter Aa + Ay. Thus, the mini- 
mum of two independent exponentials with parameters Aq and A, is an exponential 
with parameter Aa + Ap. 

For a more intuitive explanation of this fact, let us think of Ta and Tẹ as 
the times of the first arrivals in two independent Poisson processes with rates Aa 
and Ap, respectively. If we merge these two processes, the first arrival time will be 
min{T.,7,}. But we already know that the merged process is Poisson with rate 
Aa + Av, and it follows that the first arrival time, min(T;, Tẹ}, is exponential with 
parameter Àa + Ao. 


The preceding discussion can be generalized to the case of more than two 
processes. Thus, the total arrival process obtained by merging the arrivals of 
n independent Poisson processes with arrival rates A},...,An is Poisson with 
arrival rate equal to the sum A1 t ::: + An. 


Example 6.16. More on Competing Exponentials. Three light bulbs have 
independent exponentially distributed lifetimes with a common parameter A. What 
is the expected value of the time until the last bulb burns out? 

We think of the times when each bulb burns out as the first arrival times 
in independent Poisson processes. In the beginning, we have three bulbs, and the 
merged process has rate 3A. Thus, the time 7; of the first burnout is exponential 
with parameter 3A, and mean 1/3A. Once a bulb burns out, and because of the 
memorylessness property of the exponential distribution, the remaining lifetimes 
of the other two light bulbs are again independent exponential random variables 
with parameter A. We thus have two Poisson processes running in parallel, and 
the remaining time T2 until the first arrival in one of these two processes is now 
exponential with parameter 2A and mean 1/24. Finally, once a second bulb burns 
out, we are left with a single one. Using memorylessness once more, the remaining 
time T3 until the last bulb burns out is exponential with parameter A and mean 
1/A. Thus, the expected value of the total time is 

1 1 1 
E[Ti +T + T3] = ay oe 
Note that the random variables Tj, T2, T3 are independent, because of memoryless- 
ness. This allows us to also compute the variance of the total time: 
1 1 1 


var(T; + T2 + T3) = var(T;) + var(T2) + var(T3) = JAZ + De? + XP 
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Bernoulli and Poisson Processes, and Sums of Random Variables 


The insights obtained from splitting and merging of Bernoulli and Poisson pro- 
cesses can be used to provide simple explanations of some interesting properties 
involving sums of a random number of independent random variables. Alter- 
native proofs, based for example on manipulating PMFs/PDFs, solving derived 
distribution problems, or using transforms. tend to be unintuitive. We collect 
these properties in the following table. 


Properties of Sums of a Random Number of Random Variables 


Let N, X1, X2,... be independent random variables, where N takes nonneg- 
ative integer values. Let Y = Xj; +---+ Xy for positive values of N, and 
let Y = 0 when N = 0. 


e If X; is Bernoulli with parameter p, and N is binomial with parameters 
m and q, then Y is binomial with parameters m and pq. 


If X; is Bernoulli with parameter p, and N is Poisson with parameter 
À, then Y is Poisson with parameter Ap. 


If X; is geometric with parameter p, and N is geometric with param- 
eter q, then Y is geometric with parameter pq. 


If X; is exponential with parameter A, and N is geometric with pa- 
rameter q, then Y is exponential with parameter Aq. 





The first two properties are shown in Problem 22, the third property is 
shown in Problem 6. and the last property is shown in Problem 23. The last 
three properties were also shown in Chapter 4. by using transforms (see Section 
4.4 and the last end-of-chapter problem of Chapter 4). One morerelated property 
is shown in Problem 24. namely that if N; denotes the number of arrivals of a 
Poisson process with parameter A within an interval of length t£, and T is an 
interval with length that is exponentially distributed with parameter v and is 
independent of the Poisson process. then NT + 1 is geometrically distributed 
with parameter v/(A + v). 

Let us also note a related and quite deep fact. namely that the sum of 
a large number of (not necessarily Poisson) independent arrival processes, can 
be approximated by a Poisson process with arrival rate equal to the sum of 
the individual arrival rates. The component processes must have a small rate 
relative to the total (so that none of them imposes its probabilistic character on 
the total arrival process) and they must also satisfy some technical mathematical 
assumptions. Further discussion of this fact is beyond our scope, but we note 
that it is in large measure responsible for the abundance of Poisson-like processes 
in practice. For example. the telephone traffic originating in a city consists of 
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many component processes, each of which characterizes the phone calls placed by 
individual residents. The component processes need not be Poisson: some people 
for example tend to make calls in batches, and (usually) while in the process of 
talking, cannot initiate or receive a second call. However, the total telephone 
traffic is well-modeled by a Poisson process. For the same reasons, the process 
of auto accidents in a city, customer arrivals at a store, particle emissions from 
radioactive material, etc., tend to have the character of the Poisson process. 


The Random Incidence Paradox 


The arrivals of a Poisson process partition the time axis into a sequence of 
interarrival intervals; each interarrival interval starts with an arrival and ends at 
the time of the next arrival. We have seen that the lengths of these interarrival 
intervals are independent exponential random variables with parameter A, where 
A is the rate of the process. More precisely, for every k, the length of the kth 
interarrival interval has this exponential distribution. In this subsection, we look 
at these interarrival intervals from a different perspective. 

Let us fix a time instant ¢* and consider the length L of the interarrival 
interval that contains ¢*. For a concrete context, think of a person who shows 
up at the bus station at some arbitrary time t* and records the time from the 
previous bus arrival until the next bus arrival. The arrival of this person is often 
referred to as a “random incidence," but the reader should be aware that the 
term is misleading: ¢* is just a particular time instance, not a random variable. 

We assume that ¢* is much larger than the starting time of the Poisson 
process so that we can be fairly certain that there has been an arrival prior to 
time £*. To avoid the issue of how large t* should be, we assume that the Poisson 
process has been running forever, so that we can be certain that there has been 
a prior arrival, and that L is well-defined. One might superficially argue that L 
is the length of a "typical" interarrival interval, and is exponentially distributed, 
but this turns out to be false. Instead, we will establish that L has an Erlang 
PDF of order two. 

This is known as the random incidence phenomenon or paradoz, and it can 
be explained with the help of Fig. 6.7. Let [U, V] be the interarrival interval that 
contains t*, so that L = V — U. In particular, U is the time of the first arrival 
prior to ¢* and V is the time of the first arrival after t*. We split L into two 
parts, 

L — (t* - U) * (V - t*). 


where t* —U is the elapsed time since the last arrival, and V —t* is the remaining 
time until the next arrival. Note that t* — U is determined by the past history of 
the process (before t*), while V — t* is determined by the future of the process 
(after time t*). By the independence properties of the Poisson process, the 
random variables t* — U and V — t* are independent. By the memorylessness 
property, the Poisson process starts fresh at time ¢*, and therefore V — t* is 
exponential with parameter A. The random variable t* — U is also exponential 
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with parameter A. The easiest way to see this is to realize that if we run a 
Poisson process backwards in time it remains Poisson; this is because the defining 
properties of a Poisson process make no reference to whether time moves forward 
or backward. A more formal argument is obtained by noting that 


P(t* — U > x) = P(no arrivals during [t* — z,t*]) = P(0, £) = e-?^*, r0. 


We have therefore established that L is the sum of two independent exponential 
random variables with parameter A, i.e., Erlang of order two, with mean 2/4. 
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Figure 6.7: Illustration of the random incidence phenomenon. For a fixed time 
instant t", the corresponding interarrival interval [U,V] consists of the elapsed 
time £ — U and the remaining time V — t*. These two times are independent 
and are exponentially distributed with parameter A, so the PDF of their sum is 
Erlang of order two. 


Random incidence phenomena are often the source of misconceptions and 
errors, but these can be avoided with careful probabilistic modeling. The key 
issue is that an observer who arrives at an arbitrary time is more likely to fall in 
a large rather than a small interarrival interval. As a consequence the expected 
length seen by the observer is higher. 2/A compared with the 1/A mean of the 
exponential PDF. A similar situation arises in the example that follows. 


Example 6.17. Random Incidence in a Non-Poisson Arrival Process. 
Buses arrive at a station deterministically, on the hour, and five minutes after the 
hour. Thus, the interarrival times alternate between 5 and 55 minutes. The average 
interarrival time is 30 minutes. A person shows up at the bus station at a “random” 
time. We interpret “random” to mean a time that is uniformly distributed within 
a particular hour. Such a person falls into an interarrival interval of length 5 with 
probability 1/12, and an interarrival interval of length 55 with probability 11/12. 
The expected length of the chosen interarrival interval is 


1 11 
$: i5 + 55° 15 = 80.83, 


which is considerably larger than 30, the average interarrival time. 


6.3 
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As the preceding example indicates. random incidence is a subtle phe- 
nomenon that introduces a bias in favor of larger interarrival intervals, and can 
manifest itself in contexts other than the Poisson process. In general, whenever 
different calculations give contradictory results, the reason is that they refer to 
different probabilistic mechanisms. For instance. considering a fixed nonran- 
dom k and the associated random value of the kth interarrival interval is a 
different experiment from fixing a time t and considering the random K such 
that the Ath interarrival interval contains t. 

For a last example with the same flavor, consider a survey of the utilization 
of town buses. One approach is to select a few buses at random and calculate 
the average number of riders in the selected buses. An alternative approach is to 
select a few bus riders at random, look at the buses that they rode and calculate 
the average number of riders in the latter set of buses. The estimates produced 
by the two methods have very different statistics, with the second method being 
biased upwards. The reason is that with the second method, it is much more 
likely to select a bus with a large number of riders than a bus that is near-empty. 


SUMMARY AND DISCUSSION 


In this chapter. we introduced and analyzed two memoryless arrival processes. 
The Bernoulli process evolves in discrete time, and during each discrete time 
step. there is a constant probability p of an arrival. The Poisson process evolves 
in continuous time, and during each small interval of length 6 > 0, there is a 
probability of an arrival approximately equal to Aó. In both cases, the numbers of 
arrivals in disjoint time intervals are assumed independent. The Poisson process 
can be viewed as a limiting case of the Bernoulli process, in which the duration 
of each discrete time slot is taken to be a very small number ô. This fact can be 
used to draw parallels between the major properties of the two processes, and 
to transfer insights gained from one process to the other. 

Using the memorylessness property of the Bernoulli and Poisson processes, 
we derived the following. 


(a) The PMF of the number of arrivals during a time interval of given length 
is binomial and Poisson. respectively. 


(b) The distribution of the time between successive arrivals is geometric and 
exponential, respectively. 


(c) The distribution of the time until the kth arrival, is Pascal of order k and 
Erlang of order k, respectively. 


Furthermore, we saw that one can start with two independent Bernoulli 
(respectively, Poisson) processes and “merge” them to form a new Bernoulli (re- 
spectively. Poisson) process. Conversely. if one “accepts” each arrival by tossing 
a coin with success probability p (“splitting”), the process of accepted arrivals 
is a Bernoulli or Poisson process whose arrival rate is p times the original rate. 
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We finally considered the "random incidence" phenomenon where an ex- 
ternal observer arrives at some given time and measures the interarrival interval 
within which he arrives. The probabilistic properties of the measured interval 
are not "typical" because the arriving observer is more likely to fall in a larger 
interarrival interval. This phenomenon indicates that when talking about a “typ- 
ical" interval, one must carefully describe the mechanism by which it is selected. 
Different mechanisms will in general result in different probabilistic properties. 
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PROBLEMS 


SECTION 7.1. The Bernoulli Process 


Problem 1. Each of n packages is loaded independently onto either a red truck (with 
probability p) or onto a green truck (with probability 1 — p). Let R be the total number 
of items selected for the red truck and let G be the total number of items selected for 
the green truck. 


(a) Determine the PMF, expected value, and variance of the random variable R. 


(b) Evaluate the probability that the first item to be loaded ends up being the only 
one on its truck. 


(c) Evaluate the probability that at least one truck ends up with a total of exactly 
one package. 


(d) Evaluate the expected value and the variance of the difference R — G. 


(e) Assume that n 2 2. Given that both of the first two packages to be loaded go 
onto the red truck, find the conditional expectation, variance, and PMF of the 
random variable R. 


Problem 2. Dave fails quizzes with probability 1/4, independent of other quizzes. 


(a) What is the probability that Dave fails exactly two of the next six quizzes? 


(b) What is the expected number of quizzes that Dave will pass before he has failed 
three times? 


(c) What is the probability that the second and third time Dave fails a quiz will 
occur when he takes his eighth and ninth quizzes, respectively? 


(d) What is the probability that Dave fails two quizzes in a row before he passes two 
quizzes in a row? 


Problem 3. A computer system carries out tasks submitted by two users. Time 
is divided into slots. A slot can be idle, with probability p; — 1/6, and busy with 
probability pg = 5/6. During a busy slot, there is probability p;jg = 2/5 (respectively, 
DP2|p = 3/5) that a task from user 1 (respectively, 2) is executed. We assume that 
events related to different slots are independent. 


(a) Find the probability that a task from user 1 is executed for the first time during 
the 4th slot. 


(b) Given that exactly 5 out of the first 10 slots were idle, find the probability that 
the 6th idle slot is slot 12. 


(c) Find the expected number of slots up to and including the 5th task from user 1. 
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(d) Find the expected number of busy slots up to and including the 5th task from 
user 1. 


(e) Find the PMF, mean, and variance of the number of tasks from user 2 until the 
time of the 5th task from user 1. 


Problem 4.* Consider a Bernoulli process with probability of success in each trial 
equal to p. 


(a) Relate the number of failures before the rth success (sometimes called a negative 
binomial random variable) to a Pascal random variable and derive its PMF. 


(b) Find the expected value and variance of the number of failures before the rth 
success. 


(c) Obtain an expression for the probability that the ith failure occurs before the 
rth success. 


Solution. (a) Let Y be the number of trials until the rth success. which is a Pascal 
random variable of order r. Let X be the number of failures before the rth success, so 
that X = Y — r. Therefore, px (k) = py (k +r), and 


k+r-1 
r—1 


px(k) — ( jra- k 20,1,.... 


(b) Using the notation of part (a), we have 


heier: ope Ua 
p p 
Furthermore. 
var(X) = var(Y) = (1 -pr 


(c) Let again X be the number of failures before the rth success. The ith failure occurs 
before the rth success if and only if X > i. Therefore, the desired probability is equal 


to 
= T k+r-1 k 
Sex = (I rane (ias 
k=i k=: 

An alternative formula is derived as follows. Consider the first r +i — 1 trials. The 
number of failures in these trials is at least i if and only if the number of successes is 
less than r. But this is equivalent to the ith failure occurring before the rth success. 
Hence. the desired probability is the probability that the number of successes in r+i—1 
trials is less than r. which is 


r—1 3 
X (ri epp (5133 


k=0 


Problem 5.* Random incidence in the Bernoulli process. Your cousin has 
been playing the same video game from time immemorial. Assume that he wins each 
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game with probability p. independent of the outcomes of other games. At midnight, 
you enter his room and witness his losing the current game. What is the PMF of the 
number of lost games between his most recent win and his first future win? 


Solution. Let t be the number of the game when you enter the room. Let M be the 
number of the most recent past game that he won, and let N be the number of the first 
game to be won in the future. The random variable X — N — t is geometrically dis- 
tributed with parameter p. By symmetry and independence of the games, the random 
variable Y = t — M is also geometrically distributed with parameter p. The games he 
lost between his most recent win and his first future win are all the games between M 
and N. Their number L is given by 


L-N-M-1-2-X-Y-I1. 
Thus. L + 1 has a Pascal PMF of order two, and 


k-1 


PUL-+1= 8) =( : 


ea - p)? =(k-1)p*(1—p)*"?,  kz2,8... 


Hence. 
pL())-P(L-1-2i-1l)-ip(1—-p) |  i-21,2... 


Problem 6.* Sum of a geometric number of independent geometric random 
variables. Let Y = X; +---+Xwn. where the random variables X, are geometric with 
parameter p. and N is geometric with parameter q. Assume that the random variables 
N, X1. X2.... are independent. Show. without using transforms, that Y is geometric 
with parameter pq. Hint: Interpret the various random variables in terms of a split 
Bernoulli process. 


Solution. We derived this result in Chapter 4. using transforms. but we develop a 
more intuitive derivation here. We interpret the random variables X; and N as follows. 
We view the times X,. X; + X2, etc. as the arrival times in a Bernoulli process with 
parameter p. Each arrival is rejected with probability 1 — q and is accepted with 
probability g. We interpret N as the number of arrivals until the first acceptance. The 
process of accepted arrivals is obtained by splitting a Bernoulli process and is therefore 
itself Bernoulli with parameter pg. The random variable Y = X, +---+ Xw is the 
time of the first accepted arrival and is therefore geometric. with parameter pq. 


Problem 7.* The bits in a uniform random variable form a Bernoulli pro- 
cess. Let X1, X2.... be a sequence of binary random variables taking values in the set 
(0,1). Let Y be a continuous random variable that takes values in the set [0, 1]. We 
relate X and Y by assuming that Y is the real number whose binary representation is 
0.X,X2X3.... More concretely. we have 


Y= Slay. 
k=1 


(a) Suppose that the X; form a Bernoulli process with parameter p = 1/2. Show 
that Y is uniformly distributed. Hint: Consider the probability of the event 
(i — 1)/2* « Y « i/2*. where i and k are positive integers. 
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(b) Suppose that Y is uniformly distributed. Show that the X; form a Bernoulli 
process with parameter p — 1/2. 


Solution. (a) We have 


P(Y e [0,1/2]) = P(X, = 0) ; = P(Y e [1/2. 1]). 


Furthermore, 
P(Y € [0.1/4) = P(Xi 20. X220) = + 


Arguing similarly, we consider an interval of the form [(é — 17255795]. where i and k 
are positive integers and i < 2*. For Y to fall in the interior of this interval, we need 
X1,..., Xy to take on a particular sequence of values (namely, the digits in the binary 
expansion of i — 1). Hence, 


; . 1 
P ((i-1)/2° < Y <i/2*) = m 
Note also that for any y € [0. 1], we have P(Y = y) = 0, because the event {Y = y} can 
only occur if infinitely many X;s take on particular values, a zero probability event. 
Therefore, the CDF of Y is continuous and satisfies 


P(Y < i/2*) = i/2*. 


Since every number y in [0,1] can be closely approximated by a number of the form 
i/2*. we have P(Y < y) = y. for every y € [0, 1], which establishes that Y is uniform. 


(b) As in part (a). we observe that every possible zero-one pattern for X1,..., Xx 
is associated to one particular interval of the form [ — 1)/2*.i/2*] for Y. These 
intervals have equal length, and therefore have the same probability 1/2*. since Y 
is uniform. This particular joint PMF for X;,.... Xk, corresponds to independent 
Bernoulli random variables with parameter p — 1/2. 


SECTION 7.2. The Poisson Process 


Problem 8. During rush hour. from 8 a.m. to 9 a.m., traffic accidents occur according 
to a Poisson process with a rate of 5 accidents per hour. Between 9 a.m. and 11 a.m., 
they occur as an independent Poisson process with a rate of 3 accidents per hour. What 
is the PMF of the total number of accidents between 8 a.m. and 11 a.m.? 


Problem 9. An athletic facility has 5 tennis courts. Pairs of players arrive at the 
courts and use a court for an exponentially distributed time with mean 40 minutes. 
Suppose a pair of players arrives and finds all courts busy and k other pairs waiting in 
queue. What is the expected waiting time to get a court? 


Problem 10. A fisherman catches fish according to a Poisson process with rate 
A = 0.6 per hour. The fisherman will keep fishing for two hours. If he has caught at 
least one fish, he quits. Otherwise, he continues until he catches at least one fish. 
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(a) Find the probability that he stays for more than two hours. 


(b) Find the probability that the total time he spends fishing is between two and five 
hours. 


(c) Find the probability that he catches at least two fish. 
(d) Find the expected number of fish that he catches. 


(e) Find the expected total fishing time, given that he has been fishing for four hours. 


Problem 11. Customers depart from a bookstore according to a Poisson process 
with rate A per hour. Each customer buys a book with probability p, independent of 
everything else. 


(a) Find the distribution of the time until the first sale of a book. 
(b) Find the probability that no books are sold during a particular hour. 


(c) Find the expected number of customers who buy a book during a particular hour. 


Problem 12. A pizza parlor serves n different types of pizza, and is visited by 
a number K of customers in a given period of time, where K is a Poisson random 
variable with mean A. Each customer orders a single pizza, with all types of pizza 
being equally likely, independent of the number of other customers and the types of 
pizza they order. Find the expected number of different types of pizzas ordered. 


Problem 13. Transmitters A and B independently send messages to a single receiver 
in a Poisson manner, with rates of A4 and Ag, respectively. All messages are so brief 
that we may assume that they occupy single points in time. The number of words in 
a message, regardless of the source that is transmitting it, is a random variable with 
PMF 


2/6, ifw- 1, 

_ ) 3/6, if w =2, 

DES le GP ae, 
0, otherwise, 


and is independent of everything else. 


(a) What is the probability that during an interval of duration t, a total of exactly 
nine messages will be received? 


(b) Let N be the total number of words received during an interval of duration t. 
Determine the expected value of N. 


(c) Determine the PDF ofthe time from t — 0 until the receiver has received exactly 
eight three-word messages from transmitter A. 


(d) What is the probability that exactly eight of the next twelve messages received 
will be from transmitter A? 


Problem 14. Beginning at time t = 0. we start using bulbs. one at a time, to 
illuminate a room. Bulbs are replaced immediately upon failure. Each new bulb is 
selected independently by an equally likely choice between a type-A bulb and a type-B 
bulb. The lifetime, X, of any particular bulb of a particular type is a random variable, 
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independent of everything else, with the following PDF: 


(a) 
(b) 
(c) 


(d) 
(e) 
(f) 


(g) 
(h) 


(i 


SS 


(j) 


e, ifr20, 


for type-A Bulbs: fx(z) = { 0 Gira. 


— 3x . > 
for type-B Bulbs: fx (xz) = i : ee 


Find the expected time until the first failure. 
Find the probability that there are no bulb failures before time t. 


Given that there are no failures until time t, determine the conditional probability 
that the first bulb used is a type-A bulb. 


Find the variance of the time until the first bulb failure. 
Find the probability that the 12th bulb failure is also the 4th type-A bulb failure. 


Up to and including the 12th bulb failure, what is the probability that a total of 
exactly 4 type-A bulbs have failed? 


Determine either the PDF or the transform associated with the time until the 
12th bulb failure. 


Determine the probability that the total period of illumination provided by the 
first two type-B bulbs is longer than that provided by the first type-A bulb. 


Suppose the process terminates as soon as a total of exactly 12 bulb failures 
have occurred. Determine the expected value and variance of the total period of 
illumination provided by type-B bulbs while the process is in operation. 


Given that there are no failures until time t, find the expected value of the time 
until the first failure. 


Problem 15. A service station handles jobs of two types, A and B. (Multiple jobs can 
be processed simultaneously.) Arrivals of the two job types are independent Poisson 
processes with parameters àa = 3 and Ag = 4 per minute, respectively. Type A jobs 
stay in the service station for exactly one minute. Each type B job stays in the service 
station for a random but integer amount of time which is geometrically distributed, 
with mean equal to 2, and independent of everything else. The service station started 
operating at some time in the remote past. 


(a) What is the mean, variance, and PMF of the total number of jobs that arrive 


within a given three-minute interval? 


(b) Weare told that during a 10-minute interval, exactly 10 new jobs arrived. What 


(c) 


(d) 


is the probability that exactly 3 of them are of type A? 


At time 0, nojobis present in the service station. What is the PMF of the number 
of type B jobs that arrive in the future, but before the first type A arrival? 


At time t = 0, there were exactly two type A jobs in the service station. What 
is the PDF of the time of the last (before time 0) type A arrival? 


(e) At time 1, there was exactly one type B job in the service station. Find the 


distribution of the time until this type B job departs. 
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Problem 16. Each morning, as you pull out of your driveway. you would like to make 
a U-turn rather than drive around the block. Unfortunately, U-turns are illegal in your 
neighborhood, and police cars drive by according to a Poisson process with rate A. You 
decide to make a U-turn once you see that the road has been clear of police cars for 7 
time units. Let N be the number of police cars you see before you make the U-turn. 


(a) Find E[N]. 


(b) Find the conditional expectation of the time elapsed between police cars n — 1 
and n, given that N > n. 


(c) Find the expected time that you wait until you make the U-turn. Hint: Condition 
on N. 


Problem 17. A wombat in the San Diego zoo spends the day walking from a burrow 
to a food tray, eating, walking back to the burrow, resting, and repeating the cycle. 
The amount of time to walk from the burrow to the tray (and also from the tray to the 
burrow) is 20 secs. The amounts of time spent at the tray and resting are exponentially 
distributed with mean 30 secs. The wombat, with probability 1/3, will momentarily 
stand still (for a negligibly small time) during a walk to or from the tray, with all times 
being equally likely (and independent of what happened in the past). A photographer 
arrives at a random time and will take a picture at the first time the wombat will stand 
still. What is the expected value of the length of time the photographer has to wait to 
snap the wombat's picture? 


Problem 18.* Consider a Poisson process. Given that a single arrival occurred in a 
given interval [0, t], show that the PDF of the arrival time is uniform over (0, t]. 


Solution. Consider an interval [a,b] C [0. t] of length l = b — a. Let T be the time of 
the first arrival, and let A be the event that a single arrival occurred during [0, t]. We 
have 

P(T € [a,b] and A) 


P(T € {a.6]| A) = P 


The numerator is equal to the probability P(1,/) that the Poisson process has exactly 
one arrival during the length l interval [a,b], times the probability P(0,t — l) that the 
process has zero arrivals during the set [0, a) U (b, t]. of total length t — l. Hence 


Pewien agg cm Dew 7 


which establishes that T' is uniformly distributed. 
Problem 19.* 


(a) Let X1 and X2 be independent and exponentially distributed. with parameters 
Ai and A2, respectively. Find the expected value of max( Xi. X2). 


(b) Let Y be exponentially distributed with parameter Aj. Let Z be Erlang of order 
2 with parameter A5. Assume that Y and Z are independent. Find the expected 
value of max(Y, Z}. 


Solution. A direct but tedious approach would be to find the PDF of the random 
variable of interest and then evaluate an integral to find its expectation. A much 
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simpler solution is obtained by interpreting the random variables of interest in terms 
of underlying Poisson processes. 


(a) Consider two independent Poisson processes with rates à and A», respectively. We 
interpret X; as the first arrival time in the first process, and Xz as the first arrival 
time in the second process. Let T = min(Xi, X2) be the first time when one of the 
processes registers an arrival. Let S = max(Xi, X2} — T be the additional time until 
both have registered an arrival. Since the merged process is Poisson with rate 1 + A2, 


we have 
1 


E[T] = ares 


Concerning S. there are two cases to consider. 


(i) The first arrival comes from the first process; this happens with probability 
Ai/(A1 + A2). We then have to wait for an arrival from the second process, 
which takes 1/2 time on the average. 


(ii) The first arrival comes from the second process; this happens with probability 
A2/(A1 + A2). We then have to wait for an arrival from the first process, which 
takes 1/\, time on the average. Putting everything together, we obtain 


1 à 1 À2 1 


M+ ArtA2 Ae AtA A 


1 ài =) 
SS ies) 
xx AF 


E| max{X1, X2}] = 





(b) Consider two independent Poisson processes with rates 1 and 2, respectively. 
We interpret Y as the first arrival time in the first process, and Z as the second 
arrival time in the second process. Let T be the first time when one of the processes 
registers an arrival. Since the merged process is Poisson with rate A1 + A2, we have 
E[T] = 1/(A1 + A2). There are two cases to consider. 


(i) The arrival at time T comes from the first process; this happens with probability 
Ai/(A1 + 2). In this case, we have to wait an additional time until the second 
process registers two arrivals. This additional time is Erlang of order 2, with 
parameter A2, and its expected value is 2/A2. 


(ii) The arrival at time T' comes from the second process; this happens with prob- 
ability A2/(A1 + A2). In this case, the additional time S we have to wait is the 
time until each of the two processes registers an arrival. This is the maximum 
of two independent exponential random variables and, according to the result of 
part (a), we have 


i 1 AY 2) 
ke c OP e 


Putting everything together, we have 


1 Àl 2 A2 


EAI cios args id xo X 


- E[S]. 








where E[S] is given by the previous formula. 
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Problem 20.* Let Y; be the time of the kth arrival in a Poisson process with rate 
A. Show that for all y > 0, 


n9) =. 
k=1 


Solution. We have 


oo 


oo Mule 
2 f) EOD EE 


The last equality holds because the A" y" e^?" /m! terms are the values of a Poisson 
PMF with parameter Ay and must therefore sum to 1. 

For a more intuitive derivation, let ó be a small positive number and consider the 
following events: 


Ax: the kth arrival occurs between y and y--ó; the probability of this event is P( Ak) © 
fv, (y)ó; 

A: an arrival occurs between y and y + 6; the probability of this event is P(A) ~ 
fv (y)6- 


Suppose that 6 is taken small enough so that the possibility of two or more arrivals 
during an interval of length ó can be ignored. With this approximation, the events 
A1, A5, ... become disjoint, and their union is A. Therefore, 


fu): 6e Y P(A) 
k=1 k=1 


~ P(A) 
£z AÓ, 


and the desired result follows by canceling ó from both sides. 


Problem 21.* Consider an experiment involving two independent Poisson processes 
with rates 4; and A2. Let Xij(k) and X2(k) be the times of the kth arrival in the 1st 
and the 2nd process, respectively. Show that 


P (Xi(n) < X2(m)) = b» a ) (REY [o 


k=n 





Solution. Consider the merged Poisson process, which has rate A1 + A2. Each time 
there is an arrival in the merged process, this arrival comes from the first process 
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(“success”) with probability A;/(A1 + A2), and from the second process (“failure”) with 
probability A2/(A1 + A2). Consider the situation after n + m — 1 arrivals. The number 
of arrivals from the first process is at least n if and only if the number of arrivals from 
the second process is less than m, which happens if and only if the nth success occurs 
before the mth failure. Thus, the event (Xi(n) « X2(m)} is the same as the event of 
having at least n successes in the first n + m — 1 trials. Therefore, using the binomial 
PMF for the number of successes in a given number of trials, we have 


n--m-1 n m= k n+-m—l—k 
P(Xa(n) < Xm) = 5, ( h ) Uca Gee) | 


k=n 


Problem 22.* Sum of a random number of independent Bernoulli random 
variables. Let N, Xj, X2.... be independent random variables, where N takes non- 
negative integer values, and Xi, X2,... are Bernoulli with parameter p. Let Y = 
Xi t 5: XN for positive values of N and let Y = 0 when N = 0. 


(a) Show that if N is binomial with parameters m and q, then Y is binomial with 
parameters m and pq. 


(b) Show that if N is Poisson with parameter A, then Y is Poisson with parameter 
Ap. 


Solution. (a) Consider splitting the Bernoulli process Xi, X2,... by keeping successes 
with probability q and discarding them with probability 1 — q. Then Y represents the 
number of successes in the split process during the first m trials. Since the split process 
is Bernoulli with parameter pq, it follows that Y is binomial with parameters m and 
Pq. 

(b) Consider splitting a Poisson process with parameter A by keeping arrivals with 
probability q and discarding them with probability 1 — q. Then Y represents the 
number of arrivals in the split process during a unit interval. Since the split process is 
Poisson with parameter Ap, it follows that Y is Poisson with parameter Ap. 


Problem 23.* Sum of a geometric number of independent exponential ran- 
dom variables. Let Y = X;+---+Xwn, where the random variables X; are exponen- 
tial with parameter A, and N is geometric with parameter p. Assume that the random 
variables N, X1, X2,... are independent. Show, without using transforms, that Y is 
exponential with parameter Ap. Hint: Interpret the various random variables in terms 
of a split Poisson process. 


Solution. We derived this result in Chapter 4, using transforms, but we develop a 
more intuitive derivation here. We interpret the random variables X, and N as follows. 
We view the times Xi, X1 + X2, etc. as the arrival times in a Poisson process with 
parameter A. Each arrival is rejected with probability 1 — p and is accepted with 
probability p. We interpret N as the number of arrivals until the first acceptance. The 
process of accepted arrivals is obtained by splitting a Poisson process and is therefore 
itself Poisson with parameter Ap. Note that Y = X1 +---+ Xx is the time of the first 
accepted arrival and is therefore exponential with parameter Ap. 


Problem 24.* The number of Poisson arrivals during an exponentially dis- 
tributed interval. Consider a Poisson process with parameter A, and an independent 
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random variable T, which is exponential with parameter v. Find the PMF of the 
number of Poisson arrivals during the time interval [0, T]. 


Solution. Let us view T as the first arrival time in a new, independent, Poisson process 
with parameter v, and merge this process with the original Poisson process. Each 
arrival in the merged process comes from the original Poisson process with probability 
A/(A + v), independent of other arrivals. If we view each arrival in the merged process 
as a trial, and an arrival from the new process as a success, we note that the number 
K of trials/arrivals until the first success has a geometric PMF, of the form 


v A k-l 
p(k) = (=) (=>) = dudes. 








Now the number L of arrivals from the original Poisson process until the first “success” 
is equal to K — 1 and its PMF is 





pm px enm (I) (3-5). tear... 


Problem 25.* An infinite server queue. We consider a queueing system with an 
infinite number of servers, in which customers arrive according to a Poisson process with 
rate A. The ith customer stays in the system for a random amount of time, denoted by 
X,. We assume that the random variables X; are independent identically distributed, 
and also independent from the arrival process. We also assume, for simplicity, that the 
Xi take integer values in the range 1,2,....n, with given probabilities. Find the PMF 
of N,, the number of customers in the system at time t. 


Solution. Let us refer to those customers i whose service time Xi is equal to k as 
"type-k" customers. We view the overall arrival process as the merging of n Poisson 
subprocesses; the kth subprocess corresponds to arrivals of type-k customers, is inde- 
pendent of the other arrival subprocesses, and has rate Apk, where pk = P(X; = k). 
Let NË be the number of type-k customers in the system at time t. Thus, 


N: = ^N 
k=1 


and the random variables NË are independent. 

We now determine the PMF of NË. A type-k customer is in the system at time t 
if and only if that customer arrived between times t — k and t. Thus, NË has a Poisson 
PMF with mean Akpy. Since the sum of independent Poisson random variables is 
Poisson, it follows that N, has a Poisson PMF with parameter 


E[N:] = 3 kpk = AE[Xi]. 


k=1 


Problem 26.* Independence of Poisson processes obtained by splitting. 
Consider a Poisson process whose arrivals are split, with each arrival assigned to one 
of two subprocesses by tossing an independent coin with success probability p. In 
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Example 6.13, it was established that each of the subprocesses is a Poisson process. 
Show that the two subprocesses are independent. 


Solution. Let us start with two independent Poisson processes Pı and P2, with rates pA 
and (1 — p)A, respectively. We merge the two processes and obtain a Poisson process 
P with rate 4. We now split the process P into two new subprocesses P; and P3, 
according to the following rule: an arrival is assigned to subprocess P; (respectively. 
P2) if and only if that arrival corresponds to an arrival from subprocess P; (respectively, 
P2). Clearly, the two new subprocesses Pj and P; are independent, since they are 
identical to the original subprocesses P; and P2. However, Pi and P3 were generated 
by a splitting mechanism that looks different than the one in the problem statement. 
We will now verify that the new splitting mechanism considered here is statistically 
identical to the one in the problem statement. It will then follow that the subprocesses 
constructed in the problem statement have the same statistical properties as P; and 
P2, and are also independent. 

So, let us consider the above described splitting mechanism. Given that P had 
an arrival at time t, this was due to either an arrival in Pı (with probability p). or to 
an arrival in P2 (probability 1 — p). Therefore, the arrival to P is assigned to Pj or 
P3 with probabilities p and 1 — p, respectively, exactly as in the splitting procedure 
described in the problem statement. Consider now the kth arrival in P and let Lk 
be the event that this arrival originated from subprocess Pı; this is the same as the 
event that the kth arrival is assigned to subprocess Pj. As explained in the context 
of Example 6.14, the events Ly are independent. Thus, the assignments of arrivals to 
the subprocesses Pi and P; are independent for different arrivals, which is the other 
requirement of the splitting mechanism described in the problem statement. 


Problem 27.* Random incidence in an Erlang arrival process. Consider an 
arrival process in which the interarrival times are independent Erlang random variables 
of order 2, with mean 2/A. Assume that the arrival process has been ongoing for a very 
long time. An external observer arrives at a given time t. Find the PDF of the length 
of the interarrival interval that contains t. 


Solution. We view the Erlang arrival process in the problem statement as part of a 
Poisson process with rate A. In particular, the Erlang arrival process registers an arrival 
once every two arrivals of the Poisson process. For concreteness, let us say that the 
Erlang process arrivals correspond to even-numbered arrivals in the Poisson process. 
Let Y; be the time of the kth arrival in the Poisson process. 

Let K be such that Yk < t < Yk+ı. By the discussion of random incidence 
in Poisson processes in the text, we have that Yx+1 — Yx is Erlang of order 2. The 
interarrival interval for the Erlang process considered in this problem is of the form 
[Yk , Yk«2] or [Yk-i, Y¥x+1], depending on whether K is even or odd, respectively. In 
the first case, the interarrival interval in the Erlang process is of the form (Yk41 — 
Yk) + (Yk+2 — Ye+1). We claim that Yx+2 — Yk41 is exponential with parameter A 
and independent of Yx+1 — Yx. Indeed, an observer who arrives at time t and notices 
that K is even, must first wait until the time Yg 41 of the next Poisson arrival. At 
that time, the Poisson process starts afresh, and the time Y 42 — Yx+1 until the next 
Poisson arrival is independent of the past (hence, independent of Yk+ı — Yk) and 
has an exponential distribution with parameter A, as claimed. This establishes that, 
conditioned on K being even, the interarrival interval length Y 42 — Yx of the Erlang 
process is Erlang of order 3 (since it is the sum of an exponential random variable 
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and a random variable which is Erlang of order 2). By a symmetrical argument, if 
we condition on K being odd, the conditional PDF of the interarrival interval length 
Yk41 — Yk-1 of the Erlang process is again the same. Since the conditional PDF of 
the length of the interarrival interval that contains t is Erlang of order 3, for every 
conditioning event, it follows that the unconditional PDF is also Erlang of order 3. 
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The Bernoulli and Poisson processes studied in the preceding chapter are memo- 
ryless, in the sense that the future does not depend on the past: the occurrences 
of new “successes” or "arrivals" do not depend on the past history of the process. 
In this chapter. we consider processes where the future depends on and can be 
predicted to some extent by what has happened in the past. 

We emphasize models where the effect of the past on the future is sum- 
marized by a state. which changes over time according to given probabilities. 
We restrict ourselves to models in which the state can only take a finite number 
values and can change according to probabilities that do not depend on the time 
of the change. We want to analyze the probabilistic properties of the sequence 
of state values. 

The range of applications of the type of models described in this chapter 
is truly vast. It includes just about any dynamical system whose evolution over 
time involves uncertainty, provided the state of the system is suitably defined. 
Such systems arise in a broad variety of fields, such as. for example. commu- 
nications. automatic control, signal processing. manufacturing, economics. and 
operations research. 


DISCRETE-TIME MARKOV CHAINS 


We will first consider discrete-time Markov chains, in which the state changes 
at certain discrete time instants, indexed by an integer variable n. At each time 
step n, the state of the chain is denoted by Xn. and belongs to a finite set S of 
possible states, called the state space. Without loss of generality, and unless 
there is a statement to the contrary. we will assume that S = (1,.... m}, for some 
positive integer m. The Markov chain is described in terms of its transition 
probabilities pi;: whenever the state happens to be i, there is probability pi; 
that the next state is equal to 7. Mathematically. 


pij = P(Xn+1 = j| Xn = i). igeS. 


The key assumption underlying Markov chains is that the transition probabilities 
pij apply whenever state i is visited, no matter what happened in the past, and 
no matter how state i was reached. Mathematically, we assume the Markov 
property, which requires that 


P(Xn41 = 31 X = i. X5-1 = in-1 Sek YU Xo = io) = P(Xn41 = Í| Xn = i) 
= Pij- 
for all times n, all states i. j € S. and all possible sequences 7y...., in—1 of earlier 
states. Thus. the probability law of the next state Xn+ı depends on the past 
only through the value of the present state Xn. 


The transition probabilities pi; must be of course nonnegative, and sum to 
one: 


mn 
X pj =]; for all i. 
j=l 
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We will generally allow the probabilities pi; to be positive, in which case it is 
possible for the next state to be the same as the current one. Even though the 
state does not change, we still view this as a state transition of a special type (a 
“self-transition” ). 


Specification of Markov Models 
e A Markov chain model is specified by identifying: 
(a) the set of states S = (1,..., m}, 


(b) the set of possible transitions, namely, those pairs (i, j) for which 
Dij > 0, and, 


(c) the numerical values of those pij that are positive. 


e The Markov chain specified by this model is a sequence of random 
variables Xo, X1, X2,..., that take values in S, and which satisfy 


P(Xn41 = j| Xn =i, Xa-1 = in-1,.-.,X0 = to) = pij, 


for all times n, all states i, j € S, and all possible sequences io, .. ., i1 
of earlier states. 





All of the elements of a Markov chain model can be encoded in a transition 
probability matrix, which is simply a two-dimensional array whose element 
at the ith row and jth column is pi: 


pi p12 ^ Pim 
pa po -- Pm 
-Pmi Pm2 ‘`> Dmm- 


It is also helpful to lay out the model in the so-called transition probability 
graph, whose nodes are the states and whose arcs are the possible transitions. 
By recording the numerical values of pij near the corresponding arcs, one can 
visualize the entire model in a way that can make some of its major properties 
readily apparent. 


Example 7.1. Alice is taking a probability class and in each week, she can be 
either up-to-date or she may have fallen behind. If she is up-to-date in a given 
week, the probability that she will be up-to-date (or behind) in the next week is 
0.8 (or 0.2, respectively). If she is behind in the given week, the probability that 
she will be up-to-date (or behind) in the next week is 0.6 (or 0.4, respectively). We 
assume that these probabilities do not depend on whether she was up-to-date or 
behind in previous weeks, so the problem has the typical Markov chain character 
(the future depends on the past only through the present). 
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Let us introduce states 1 and 2, and identify them with being up-to-date and 
behind, respectively. Then, the transition probabilities are 


Pi = 0.8. pi2 = 0.2, p21 = 0.6, p22 = 0.4, 


and the transition probability matrix is 
0.8 0.2 
0.6 0.4’ 


The transition probability graph is shown in Fig. 7.1. 


0.2 
0.8 8:4 
up-to-date | 0.6. behind 


Figure 7.1: The transition probability graph in Example 7.1. 


Example 7.2. Spiders and Fly. A fly moves along a straight line in unit 
increments. At each time period, it moves one unit to the left with probability 
0.3, one unit to the right with probability 0.3, and stays in place with probability 
0.4, independent of the past history of movements. Two spiders are lurking at 
positions 1 and m: if the fly lands there, it is captured by a spider, and the process 
terminates. We want to construct a Markov chain model, assuming that the fly 
starts in a position between 1 and m. 

Let us introduce states 1,2,...,m, and identify them with the corresponding 
positions of the fly. The nonzero transition probabilities are 


pu =l, Dmm = 1, 
0.3, if 7 =r1-lorj=72+1 
DEEP m ? for? =2,...,m—1. 


The transition probability graph and matrix are shown in Fig. 7.2. 


berets: 8.24 
ROT oT op] 
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Figure 7.2: The transition probability graph and the transition probability ma- 
trix in Example 7.2, for the case where m = 4. 
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Example 7.3. Machine Failure, Repair, and Replacement. A machine can 
be either working or broken down on a given day. If it is working, it will break 
down in the next day with probability b, and will continue working with probability 
1 — b. If it breaks down on a given day, it will be repaired and be working in the 
next day with probability r, and will continue to be broken down with probability 
l-r. 

We model the machine by a Markov chain with the following two states: 


State 1: Machine is working, State 2: Machine is broken down. 


The transition probability graph of the chain is given in Fig. 7.3. The transition 
probability matrix is 
1-b b 
T 0-r| 


Figure 7.3: Transition probability graph for Example 7.3. 


The situation considered here evidently has the Markov property: the state of 
the machine at the next day depends explicitly only on its state at the present day. 
However, it is possible to use a Markov chain model even if there is a dependence 
on the states at several past days. The general idea is to introduce some additional 
states which encode relevant information from preceding periods, as in the variation 
that we consider next. 

Suppose that whenever the machine remains broken for a given number of 
£ days, despite the repair efforts, it is replaced by à new working machine. To 
model this as a Markov chain, we replace the single state 2, corresponding to a 
broken down machine, with several states that indicate the number of days that 
the machine is broken. These states are 


State (2,1): The machine has been broken for i days, i — 1,2,..., € 


The transition probability graph is given in Fig. 7.4 for the case where £ = 4. 

'The second half of the preceding example illustrates that in order to con- 
struct a Markov model, there is often a need to introduce new states that capture 
the dependence of the future on the model's past history. We note that there 
is some freedom in selecting these additional states, but their number should be 
generally kept small, for reasons of analytical or computational tractability. 


The Probability of a Path 


Given a Markov chain model, we can compute the probability of any particular 
sequence of future states. This is analogous to the use of the multiplication rule 
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working broken 





Figure 7.4: Transition probability graph for the second part of Example 7.3. A 
machine that has remained broken for £ = 4 days is replaced by a new, working 
machine. 


in sequential (tree) probability models. In particular, we have 
P(Xo = io, X1 2 à,..., Xn = tn) = P(Xo = io)Pigi Diy ig^ Din-yin- 
To verify this property, note that 
P(Xo = io, X1 = à,.... Xn = in) 

= P(Xn = in | Xo = i0,- .., Xn-1 = ee ek = 10 = i) 

= Pin-iinP(Xo = i0,.... Xn-1 = in-1), 
where the last equality made use of the Markov property. We then apply the 
same argument to the term P(Xo = io,..., Xn-1 = in-1) and continue similarly, 
until we eventually obtain the desired expression. If the initial state Xo is given 
and is known to be equal to some io, a similar argument yields 

P(Xi cms js e Xa = in | Xo = io) = Pigi Pi, ia ttt Din yin: 


Graphically, a state sequence can be identified with a sequence of arcs in the 
transition probability graph, and the probability of such a path (given the ini- 
tial state) is given by the product of the probabilities associated with the arcs 
traversed by the path. 


Example 7.4. For the spider and fly example (Example 7.2), we have 
P(X; = 2. X2 = 2, Xa = 3, X4 = 4| Xo = 2) = pooperpeapa4 = (0.4) (0.3). 
We also have 
P(Xo 22, Xy 22. X222, X3 = 3, X4 = 4) = P( Xo = 2)p22p22p23p34 
= P(Xp = 2)(0.4)*(0.3y^. 


Note that in order to calculate a probability of this form, in which there is no 
conditioning on a fixed initial state. we need to specify a probability law for the 
initial state Xo. 
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Many Markov chain problems require the calculation of the probability law of 
the state at some future time, conditioned on the current state. This probability 
law is captured by the n-step transition probabilities, defined by 


rij(n) = P(Xn = j| Xo = i). 


In words, r;;(n) is the probability that the state after n time periods will be j, 
given that the current state is i. It can be calculated using the following basic 
recursion, known as the Chapman-Kolmogorov equation. 


Chapman-Kolmogorov Equation for the n-Step Transition 
Probabilities 


The n-step transition probabilities can be generated by the recursive formula 


m 
rij(n) = X ra(n — 1)Pkj, for n> 1, and all i, J, 
k=1 


starting with 
Tij(1) = pij- 





To verify the formula, we apply the total probability theorem as follows: 


m 
P(X, = j| Xo =i) = 3 P(Xn-1 = k| Xo =) P(Xn =j |Xn-1 = k, Xo = i) 
k=1 


7n. 
= 5 rik(n — l)pxj; 
k=l 


see Fig. 7.5 for an illustration. We have used here the Markov property: once 
we condition on Xn-1 = k, the conditioning on Xo = i does not affect the 
probability pkj of reaching j at the next step. 

We can view r;j(n) as the element at the ith row and jth column of a two- 


dimensional array, called the n-step transition probability matrix. Fig- 
ures 7.6 and 7.7 give the n-step transition probabilities r;;(n) for the cases of 


i Those readers familiar with matrix multiplication, may recognize that the 
Chapman-Kolmogorov equation can be expressed as follows: the matrix of n-step tran- 
sition probabilities r,,(n) is obtained by multiplying the matrix of (n — 1)-step tran- 
sition probabilities ri&(n — 1), with the one-step transition probability matrix. Thus, 
the n-step transition probability matrix is the nth power of the transition probability 
matrix. 


7.2 
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Figure 7.5: Derivation of the Chapman-Kolmogorov equation. The probability 
of being at state j at time n is the sum of the probabilities r;;(n — 1)p,; of the 
different ways of reaching j. 


Examples 7.1 and 7.2, respectively. There are some interesting observations 
about the limiting behavior of ri;(n) in these two examples. In Fig. 7.6, we see 
that each ri;(n) converges to a limit, as n — oo, and this limit does not depend 
on the initial state i. Thus, each state has a positive "steady-state" probabil- 
ity of being occupied at times far into the future. Furthermore, the probability 
ri;(n) depends on the initial state i when n is small, but over time this depen- 
dence diminishes. Many (but by no means all) probabilistic models that evolve 
over time have such a character: after a sufficiently long time, the effect of their 
initial condition becomes negligible. 

In Fig. 7.7, we see a qualitatively different behavior: r;;(n) again converges, 
but the limit depends on the initial state, and can be zero for selected states. 
Here, we have two states that are "absorbing," in the sense that they areinfinitely 
repeated, once reached. These are the states 1 and 4 that correspond to the 
capture of the fly by one of the two spiders. Given enough time, it is certain that 
some absorbing state will be reached. Accordingly. the probability of being at the 
non-absorbing states 2 and 3 diminishes to zero as time increases. Furthermore, 
the probability that a particular absorbing state will be reached depends on how 
"close" we start to that state. 

These examples illustrate that there js a variety of types of states and 
asymptotic occupancy behavior in Markov chains. We are thus motivated to 
classify and analyze the various possibilities, and this is the subject of the next 
three sections. 


CLASSIFICATION OF STATES 


In the preceding section, we saw some examples indicating that the various states 
of a Markov chain can have qualitatively different characteristics. In particular, 
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Figure 7.6: n-step transition probabilities for the "up-to-date/behind" Exam- 
ple 7.1. Observe that as n increases, r;;(n) converges to a limit that does not 
depend on the initial state 1. 


some states, after being visited once, are certain to be visited again, while for 
some other states this may not be the case. In this section, we focus on the 
mechanism by which this occurs. In particular, we wish to classify the states 
of a Markov chain with a focus on the long-term frequency with which they are 
visited. 

As a first step, we make the notion of revisiting a state precise. Let us say 
that a state j is accessible from a state i if for some n, the n-step transition 
probability ri;(n) is positive, i.e., if there is positive probability of reaching j, 
starting from i, after some number of time periods. An equivalent definition is 
that there is a possible state sequence 2,21,..., 51,2, that starts at 2 and ends 
at j, in which the transitions (2,21), (41,32), .. .. (£n 22, 1n 1), (£n-1, 3) all have 
positive probability. Let A(z) be the set of states that are accessible from i. We 
say that i is recurrent if for every j that is accessible from i, i is also accessible 
from j; that is, for all j that belong to A(i) we have that i belongs to A(7). 

When we start at a recurrent state i, we can only visit states j € A(i) 
from which i is accessible. Thus, from any future state, there is always some 
probability of returning to 2 and, given enough time, this is certain to happen. 
By repeating this argument, if à recurrent state is visited once, it is certain to 
be revisited an infinite number of times. (See the end-of-chapter problems for a 
more rigorous version of this argument.) 

A state is called transient if it is not recurrent. Thus, a state 1 is transient 
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Figure 7.7: The top part of the figure shows the n-step transition probabilities 
r,1(n) for the “spiders-and-fly” Example 7.2. These are the probabilities of reach- 
ing state 1 by time n, starting from state i. We observe that these probabilities 
converge to a limit, but the limit depends on the starting state. In this example, 
note that the probabilities rjo(n) and r,3(n) of being at the non-absorbing states 
2 or 3, go to zero, as n increases. 


if there is a state 7 € A(i) such that 7 is not accessible from j. After each visit to 
state i, there is positive probability that the state enters such a 7. Given enough 
time, this will happen, and state z cannot be visited after that. Thus, a transient 
state will only be visited a finite number of times; see again the end-of-chapter 
problems. 

Note that transience or recurrence is determined by the arcs of the tran- 
sition probability graph [those pairs (i,j) for which pij; > 0] and not by the 
numerical values of the pi;. Figure 7.8 provides an example of a transition prob- 
ability graph, and the corresponding recurrent and transient states. 

If i is a recurrent state, the set of states A(i) that are accessible from 2 
form a recurrent class (or simply class), meaning that states in A(:) are all 
accessible from each other, and no state outside A(1) is accessible from them. 
Mathematically, for a recurrent state i, we have A(i) = A(j) for all j that belong 
to A(i), as can be seen from the definition of recurrence. For example, in the 
graph of Fig. 7.8, states 3 and 4 form a class, and state 1 by itself also forms a 
class. 

It can be seen that at least one recurrent state must be accessible from any 
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Recurrent Trausient Recurrent Recurrent 


Figure 7.8: Classification of states given the transition probability graph. Start- 
ing from state 1, the only accessible state is itself. and so | is a recurrent state. 
States 1. 3, and 4 are accessible from 2. but 2 is not accessible from any of them. 
So State 2 is transient. States 3 and 4 are accessible from each other. and they are 
both recurrent. 


given transient state. This is intuitively evident and is left as an end-of-chapter 
problem. It follows that there must exist at least one recurrent state, and hence 
at least one class. Thus. we reach the following conclusion. 


Markov Chain Decomposition 





A Markov chain can be decomposed into one or more recurrent classes, 
plus possibly some transient states. 


e A recurrent state is accessible from all states in its class, but is not 
accessible from recurrent states in other classes. 


e A transient state is not accessible from any recurrent state. 


At least one, possibly more, recurrent states are accessible from a given 
transient state. 








Figure 7.9 provides examples of Markov chain decompositions. Decompo- 
sition provides a powerful conceptual tool for reasoning about Markov chains 
and visualizing the evolution of their state. In particular. we see that: 


(a) once the state enters (or starts in) a class of recurrent states, it stays within 
that class; since all states in the class are accessible from each other, all 
states in the class will be visited an infinite number of times: 


(b) if the initial state is transient, then the state trajectory contains an ini- 
tial portion consisting of transient states and a final portion consisting of 
recurrent states from the same class. 


For the purpose of understanding long-term behavior of Markov chains, it is im- 
portant to analyze chains that consist of a single recurrent class. For the purpose 
of understanding short-term behavior, it is also important to analyze the mech- 
anism by which any particular class of recurrent states is entered starting from a 
given transient state. These two issues. long-term and short-term behavior, are 
the focus of Sections 7.3 and 7.4. respectively. 
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Single class of recurrent states (1 and 2) 
and one transient state (3) 





of recurrent states 


Two classes 
(class of state 1 and class of states 4 and 5) 
and two transient. states (2 and 3) 






So 


Figure 7.9: Examples of Markov chain decompositions into recurrent classes and 
transient states. 


Periodicity 


There is another important characterization of a recurrent class, which relates 
to the presence or absence of a certain periodic pattern in the times that a state 
can be visited. In particular. a recurrent class is said to be periodic if its states 
can be grouped in d > 1 disjoint subsets S;..... Sg so that all transitions from 
one subset lead to the next subset: see Fig. 7.10. More precisely. 


jE Spay. ifkz1.... d — 1. 


if ¿€ Sk and pij 2 0. then ees if k — d. 


À recurrent class that is not periodic. is said to be aperiodic. 
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Figure 7.10: Structure of a periodic recurrent class. In this example. d — 3. 


Thus. in a periodic recurrent class. we move through the sequence of subsets 
in order, and after d steps, we end up in the same subset. As an example, the 
recurrent class in the second chain of Fig. 7.9 (states 1 and 2) is periodic, and 
the same is true of the class consisting of states 4 and 5 in the third chain of Fig. 
7.9. All other recurrent classes in the chains of this figure are aperiodic. 

Note that given a periodic recurrent class, a positive time n, and a state 
i in the class. there must exist one or more states j for which rij(n) = 0. The 
reason is that starting from z. only one of the sets S, is possible at time n. Thus, 
a way to verify aperiodicity of a given recurrent class R, is to check whether 
there is a special time n > 1 and a special state i € R from which all states in R 
can be reached in n steps, i.e., rij(n) > 0 for all ; € R. As an example, consider 
the first chain in Fig. 7.9. Starting from state 1, every state is possible at time 
n — 3, so the unique recurrent class of that chain is aperiodic. 

A converse statement, which we do not prove, also turns out to be true: if 
a recurrent class R is aperiodic, then there exists a time n such that ri;(n) > 0 
for every i and j in R; see the end-of-chapter problems. 


Periodicity 
Consider a recurrent class R. 


e The class is called periodic if its states can be grouped in d > 1 
disjoint subsets S;,..., Sa, so that all transitions from S, lead to Sx+1 
(or to S; if k — d). 


e The class is aperiodic (not periodic) if and only if there exists a time 
n such that ri;(n) > 0, for all 4, j € R. 


7.3 
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STEADY-STATE BEHAVIOR 


In Markov chain models, we are often interested in long-term state occupancy 
behavior, that is, in the n-step transition probabilities ri;j(n) when n is very 
large. We have seen in the example of Fig. 7.6 that the ri;(n) may converge 
to steady-state values that are independent of the initial state. We wish to 
understand the extent to which this behavior is typical. 

If there are two or more classes of recurrent states, it is clear that the 
limiting values of the ri;(n) must depend on the initial state (the possibility of 
visiting j far into the future depends on whether j is in the same class as the 
initial state 7). We will, therefore, restrict attention to chains involving a single 
recurrent class, plus possibly some transient states. This is not as restrictive 
as it may seem, since we know that once the state enters a particular recurrent 
class, it will stay within that class. Thus, the asymptotic behavior of a multiclass 
chain can be understood in terms of the asymptotic behavior of a single-class 
chain. 

Even for chains with a single recurrent class, the ri; (n) may fail to converge. 
To see this, consider a recurrent class with two states, 1 and 2, such that from 
state 1 we can only go to 2, and from 2 we can only go to 1 (p12 = p21 = 1). Then, 
starting at some state, we will be in that same state after any even number of 
transitions, and in the other state after any odd number of transitions. Formally, 


ie 1, n even, 
ruin =) 0, n odd. 


What is happening here is that the recurrent class is periodic, and for such a 
class, it can be seen that the ri;(n) generically oscillate. 

We now assert that for every state j, the probability ri;(n) of being at state 
j approaches a limiting value that is independent of the initial state i, provided 
we exclude the two situations discussed above (multiple recurrent classes and/or 
a periodic class). This limiting value, denoted by 7;, has the interpretation 


mj & P(Xn = j), when n is large, 


and is called the steady-state probability of j. The following is an important 
theorem. Its proof is quite complicated and is outlined together with several 
other proofs in the end-of-chapter problems. 





Steady-State Convergence Theorem 


Consider a Markov chain with a single recurrent class, which is aperiodic. 
Then, the states j are associated with steady-state probabilities 7; that 
have the following properties. 
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(a) For each j, we have 


lim rij(n) = Tj, for all i. 
n-0o 


(b) The 7; are the unique solution to the system of equations below: 


(c) We have 
nj —0, for all transient states j, 


7j > 0, for all recurrent states j. 








The steady-state probabilities 7; sum to 1 and form a probability distri- 
bution on the state space, called the stationary distribution of the chain. 
The reason for the qualification "stationary" is that if the initial state is chosen 
according to this distribution, i.e.. if 


P(Xo=j) =r;  j=1,...,m, 


then, using the total probability theorem, we have 
P(Xi = j) -J P0 =t )Pkj = Yon = Tj, 


where the last equality follows from part (b) of the steady-state convergence 
theorem. Similarly, we obtain P(Xn = j) = 7j, for all n and j. Thus, if the 
initial state is chosen according to the stationary distribution, the state at any 
future time will have the same distribution. 

The equations 


m 
Tj = ) TkDkj: J9l.com, 


are called the balance equations. They are a simple consequence of part (a) 
of the theorem and the Chapman-Kolmogorov equation. Indeed, once the con- 
vergence of ri; (n) to some 7; is taken for granted, we can consider the equation, 


rij(n) = > ri(n - Ups. 


k=1 
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take the limit of both sides as n — oo, and recover the balance equations.Í 
Together with the normalization equation 


the balance equations can be solved to obtain the mj. The following examples 
illustrate the solution process. 


Example 7.5. Consider a two-state Markov chain with transition probabilities 


Pu = 0.8, Pi2 = 0.2, 
p21 = 0.6, p22 = 0.4. 


(This is the same as the chain of Example 7.1 and Fig. 7.1.) The balance equations 
take the form 


71 = pi + T2P21, 72 = T1p12 + Tep22, 


or 
mı = 0.8-7 4- 0.6 72, 72 = 0.2-714+ 0.4: m2. 


Note that the above two equations are dependent, since they are both equivalent 
to 
T; = 379. 


This is a generic property, and in fact it can be shown that any one of the balance 
equations can always be derived from the remaining equations. However, we also 
know that the 7; satisfy the normalization equation 


™ +72 =1, 


which supplements the balance equations and suffices to determine the 7; uniquely. 
Indeed, by substituting the equation mı = 372 into the equation mı + 72 = 1, we 
obtain 372 + 72 = 1, or 

2 = 0.25, 


which using the equation 71 + 72 = 1, yields 
Tı = 0.75. 


This is consistent with what we found earlier by iterating the Chapman-Kolmogorov 
equation (cf. Fig. 7.6). 


1 According to a famous and important theorem from linear algebra (called the 
Perron-Frobenius theorem), the balance equations always have a nonnegative solution, 
for any Markov chain. What is special about a chain that has a single recurrent class, 
which is aperiodic, is that given also the normalization equation, the solution is unique 
and is equal to the limit of the n-step transition probabilities ri;(n). 
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Example 7.6. An absent-minded professor has two umbrellas that she uses 
when commuting from home to office and back. If it rains and an umbrella is 
available in her location, she takes it. If it is not raining, she always forgets to take 
an umbrella. Suppose that it rains with probability p each time she commutes, 
independent of other times. What is the steady-state probability that she gets wet 
during a commute? 

We model this problem using a Markov chain with the following states: 


State i: i umbrellas are available in her current location, 1290.1,2- 


The transition probability graph is given in Fig. 7.11, and the transition probability 


matrix is 
0 0 1 


0 l-p p 
1-9 p 0 


l-p p 
No umbrellas ‘Two umbrellas One umbrella 


Figure 7.11: Transition probability graph for Example 7.6. 


The chain has a single recurrent class that is aperiodic (assuming 0 < p < 1), 
so the steady-state convergence theorem applies. The balance equations are 


To = (1— pyrg, m-(l—-p)U-pm, T2 = To pm. 


From the second equation, we obtain mı = m2, which together with the first equation 
To = (1 — p)ra and the normalization equation no + 7; + 73 = 1, yields 


PE t 1 z 1 
Tm MÁS tS es 2S 
: 3—-p jp 3-p 


According to the steady-state convergence theorem, the steady-state probability 
that the professor finds herself in a place without an umbrella is zo. The steady- 
state probability that she gets wet is To times the probability of rain p. 


Example 7.7. A superstitious professor works in a circular building with m doors, 
where m is odd, and never uses the same door twice in a row. Instead he uses 
with probability p (or probability 1 — p) the door that is adjacent in the clockwise 
direction (or the counterclockwise direction, respectively) to the door he used last. 
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Door 4 


Figure 7.12: Transition probability graph in Example 7.7, for the case of 
m = 5 doors. Assuming that 0 < p < 1, it is not hard to see that given an 
initia] state, every state 7 can be reached in exactly 5 steps, and therefore the 
chain is aperiodic. 


What is the probability that a given door will be used on some particular day far 
into the future? 
We introduce a Markov chain with the following m states: 


State i: last door used is door 1, i212 um. 


The transition probability graph of the chain is given in Fig. 7.12, for the case 
m = 5. The transition probability matrix is 


0 p 0 0 0 1—p 
)-p 0 p 0 0 0 

0 1—-p 0 p 0 0 

p 0 0 0 l—p 0 


Assuming that 0 < p < 1, the chain has a single recurrent class that is aperiodic. 
(To verify aperiodicity, we leave it to the reader to verify that given an initial state, 
every state 7 can be reached in exactly m steps, and the criterion for aperiodicity 
given at the end of the preceding section is satisfied.) The balance equations are 


7) = (1 — p)na + pam, 
7, = pri-i + (1 — p)mi41, i=2,...,m—1, 
Tm = (1— pymi pra. 


These equations are easily solved once we observe that by symmetry, all doors 
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should have the same steady-state probability. This suggests the solution 


Indeed, we see that these 7; satisfy the balance equations as well as the normal- 
ization equation, so they must be the desired steady-state probabilities (by the 
uniqueness part of the steady-state convergence theorem). 

Note that if either p = 0 or p = 1, the chain still has a single recurrent 
class but is periodic. In this case, the n-step transition probabilities ri;(n) do not 
converge to a limit, because the doors are used in a cyclic order. Similarly, if m is 
even, the recurrent class of the chain is periodic, since the states can be grouped 
into two subsets, the even and the odd numbered states, such that from each subset 
one can only go to the other subset. 


Long-Term Frequency Interpretations 


Probabilities are often interpreted as relative frequencies in an infinitely long 
string of independent trials. The steady-state probabilities of a Markov chain 
admit a similar interpretation, despite the absence of independence. 

Consider, for example, a Markov chain involving a machine, which at the 
end of any day can be in one of two states, working or broken down. Each time 
it breaks down, it is immediately repaired at a cost of $1. How are we to model 
the long-term expected cost of repair per day? One possibility is to view it as the 
expected value of the repair cost on a randomly chosen day far into the future; 
this is just the steady-state probability of the broken down state. Alternatively, 
we can calculate the total expected repair cost in n days, where n is very large, 
and divide it by n. Intuition suggests that these two methods of calculation 
should give the same result. Theory supports this intuition, and in general we 
have the following interpretation of steady-state probabilities (a justification is 
given in the end-of-chapter problems). 


Steady-State Probabilities as Expected State Frequencies 


For a Markov chain with a single class which is aperiodic, the steady-state 
probabilities 7; satisfy 


where vij;(n) is the expected value of the number of visits to state j within 
the first n transitions, starting from state i. 





Based on this interpretation, 7; is the long-term expected fraction of time 
that the state is equal to j. Each time that state j is visited, there is probability 
Pjk that the next transition takes us to state k. We conclude that mjp;, can 
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be viewed as the long-term expected fraction of transitions that move the state 
from j to k.t 


Expected Frequency of a Particular Transition 


Consider n transitions of a Markov chain with a single class which is aperi- 
odic, starting from a given initial state. Let q;&(n) be the expected number 
of such transitions that take the state from j to k. Then, regardless of the 
initial state, we have 





Given the frequency interpretation of 7; and 7px, the balance equation 


m 
Tj = > TkDk3 
k=1 


has an intuitive meaning. It expresses the fact that the expected frequency 7; 
of visits to j is equal to the sum of the expected frequencies m, px; of transitions 
that lead to j; see Fig. 7.13. 





Um Pinj 


Figure 7.13: Interpretation of the balance equations in terms of frequencies. In 
a very large number of transitions, we expect a fraction 7, p,; that bring the state 
from k to j. (This also applies to transitions from j to itself, which occur with 
frequency 7jpj;;.) The sum of the expected frequencies of such transitions is the 
expected frequency 7; of being at state 7. 


+ In fact, some stronger statements are also true, such as the following. Whenever 
we carry out a probabilistic experiment and generate a trajectory of the Markov chain 
over an infinite time horizon, the observed long-term frequency with which state j is 
visited will be exactly equal to 7;, and the observed long-term frequency of transitions 
from j to k will be exactly equal to 7;p;«. Even though the trajectory is random, these 
equalities hold with essential certainty, that is, with probability 1. 
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Birth-Death Processes 


A birth-death process is a Markov chain in which the states are linearly ar- 
ranged and transitions can only occur to a neighboring state, or else leave the 
state unchanged. They arise in many contexts, especially in queueing theory. 
Figure 7.14 shows the general structure of a birth-death process and also intro- 
duces some generic notation for the transition probabilities. In particular, 


bj P(Xuii-i-cl|lXa- i), (*birth" probability at state 2), 
di 2P(X4412i—-1|X.,— i), (*death" probability at state i). 
1-5 1-b,-d, E ur d. i-d., 





Figure 7.14: Transition probability graph for a birth-death process. 


For a birth-death process, the balance equations can be substantially sim- 
plified. Let us focus on two neighboring states, say, ? and i+ 1. In any trajectory 
of the Markov chain, a transition from i to i-- 1 has to be followed by a transition 
from i+ 1 to i, before another transition from i to i+ 1 can occur. Therefore, 
the expected frequency of transitions from i to i +1, which is 7jb;, must be equal 
to the expected frequency of transitions from i + 1 to i, which is mi+ıdi+1. This 
leads to the local balance equations 


Tibi = Ti41di41, 120,1,.... n — 1. 
Using the local balance equations, we obtain 


bob; - -- bii 
Ti = T0—————— izlzlgmu 


SA are le 


from which, using also the normalization equation 5^; m; = 1, the steady-state 
probabilities 7; are easily computed. 


1 A more formal derivation that does not rely on the frequency interpretation 
proceeds as follows. The balance equation at state 0 is To(1 — bo) + ™ di = ro, which 
yields the first local balance equation mobo = midi. 

The balance equation at state 1 is mobo + T1(1 — by — di) + 2d2 = m. Using 
the local balance equation mobo = 7d; at the previous state, this is rewritten as 
71di + m (1 — by — d1) + *2d2 = mı, which simplifies to 7,6, = medz. We can then 
continue similarly to obtain the local balance equations at all other states. 
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Example 7.8. Random Walk with Reflecting Barriers. A person walks along 
a straight line and, at each time period. takes a step to the right with probability 
b. and a step to the left with probability 1 — b. The person starts in one of the 
positions 1,2,...,7n, but if he reaches position 0 (or position m + 1), his step is 
instantly reflected back to position 1 (or position m, respectively). Equivalently, we 
may assume that when the person is in positions 1 or m, he will stay in that position 
with corresponding probability 1 — b and b, respectively. We introduce a Markov 
chain model whose states are the positions 1,...,m. The transition probability 
graph of the chain is given in Fig. 7.15. 


1 
i | b b b h e f 
l-b l-h l-b j-b 


Figure 7.15: Transition probability graph for the random walk Example 7.8. 


The local balance equations are 
mb =m (l-b), i—]....,m - 1. 


Thus. 141 = p7;, where 
E PTS 
and we can express all the m; in terms of 7, as 


Ti =p m. NU eee m. 


Using the normalization equation 1 = 71 +: + Am. we obtain 


L=m(1+p+ rp") 


which leads to 
pi~! 


"UP a Sem: 


Ti 


Note that if p = 1 (left and right steps are equally likely), then 7, = 1/m for all i. 


Example 7.9. Queueing. Packets arrive at a node of a communication network, 
where they are stored in a buffer and then transmitted. The storage capacity of 
the buffer is m: if m packets are already present, any newly arriving packets are 
discarded. We discretize time in very small periods, and we assume that in each 
period, at most one event can happen that can change the number of packets stored 
in the node (an arrival of a new packet or a completion of the transmission of an 
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existing packet). In particular, we assume that at each period, exactly one of the 
following occurs: 


(a) one new packet arrives; this happens with a given probability b > 0; 


(b) one existing packet completes transmission; this happens with a given prob- 
ability d > 0 if there is at least one packet in the node, and with probability 
0 otherwise; 


(c) no new packet arrives and no existing packet completes transmission; this 
happens with probability 1 — b — d if there is at least one packet in the node, 
and with probability 1 — b otherwise. 


We introduce a Markov chain with states 0,1,...,m, corresponding to the 
number of packets in the buffer. The transition probability graph is given in 
Fig. 7.16. 


l-b- l]l-b-¢l 


Figure 7.16: Transition probability graph in Example 7.9. 


The local balance equations are 
Tib = Tipid, i-0,1,...,m -— 1l. 


We define 
p — 


al o 


and obtain ni+ı = p7i, which leads to 
Ti = p'To, i = 0,1,..., m. 
By using the normalization equation 1 = 70 + m1 +---+ 7,4, we obtain 


12 7m9(lo po p"), 


and i 
— p . 
pope eed 
To = pee 
1 ; 
ml’ if p — 1. 


Using the equation 7; — pP To, the steady-state probabilities are 


1 — ae 
Tan? if p #1, 


Ti = 2=0,1,....m. 


1 
EENET if p — 1, 
DIET up : 
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It is interesting to consider what happens when the buffer size m is so large 
that it can be considered as practically infinite. We distinguish two cases. 


(a) Suppose that b « d, or p « 1. In this case, arrivals of new packets are 
less likely than departures of existing packets. This prevents the number 
of packets in the buffer from growing, and the steady-state probabilities 7; 
decrease with i, as in a (truncated) geometric PMF. We observe that as 
m oo, we have 1 — p?*! — 1, and 


T; — p'(1— p), for all i. 


We can view these as the steady-state probabilities in a system with an infinite 
buffer. [As a check, note that we have $77, p'(1 — p) 2 1] 


(b) Suppose that b > d, or p > 1. In this case, arrivals of new packets are more 
likely than departures of existing packets. The number of packets in the buffer 
tends to increase, and the steady-state probabilities 7; increase with i. As we 
consider larger and larger buffer sizes m, the steady-state probability of any 
fixed state i decreases to zero: 


Ti > 0, for all i. 


Were we to consider a system with an infinite buffer, we would have a Markov 
chain with a countably infinite number of states. Although we do not have 
the machinery to study such chains, the preceding calculation suggests that 
every state will have zero steady-state probability and will be “transient.” The 
number of packets in queue will generally grow to infinity, and any particular 
state will be visited only a finite number of times. 


The preceding analysis provides a glimpse into the character of Markov chains 
with an infinite number of states. In such chains, even if there is a single and 
aperiodic recurrent class, the chain may never reach steady-state and a steady- 
state distribution may not exist. 


7.4 ABSORPTION PROBABILITIES AND EXPECTED TIME 
TO ABSORPTION 


In this section, we study the short-term behavior of Markov chains. We first 
consider the case where the Markov chain starts at a transient state. We are 
interested in the first recurrent state to be entered, as well as in the time until 
this happens. 

When addressing such questions, the subsequent behavior of the Markov 
chain (after a recurrent state is encountered) is immaterial. We can therefore 
focus on the case where every recurrent state k is absorbing, i.e., 


Pkk = 1, Pej = 0 for all j #k. 


If there is a unique absorbing state k, its steady-state probability is 1 (because 
all other states are transient and have zero steady-state probability), and will be 
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reached with probability 1, starting from any initial state. If there are multiple 
absorbing states, the probability that one of them will be eventually reached is 
still 1, but the identity of the absorbing state to be entered is random and the 
associated probabilities may depend on the starting state. In the sequel, we fix a 
particular absorbing state, denoted by s, and consider the absorption probability 
a; that s is eventually reached, starting from i: 


a; = P(X, eventually becomes equal to the absorbing state s | Xo = i). 


Absorption probabilities can be obtained by solving a system of linear equations, 
as indicated below. 


Absorption Probability Equations 


Consider a Markov chain where each state is either transient or absorb- 
ing, and fix a particular absorbing state s. Then, the probabilities a; of 
eventually reaching state s, starting from 7, are the unique solution to the 
equations 


for all absorbing i # s, 


for all transient 7. 





The equations a, = 1, and a; = 0, for all absorbing i # s, are evident 
from the definitions. To verify the remaining equations, we argue as follows. Let 
us consider a transient state i and let A be the event that state s is eventually 
reached. We have 


a; = P(A| Xo =i) 


P(A| Xo =i, X1 = j)P(X1 = j| Xo = i) (total probability thm.) 


I 
Ma 


a. 
ll 
Ó 


P(A| Xi = j)pij (Markov property) 


I 
Ma 


e. 
I 
= 


l 
Ma 


QjPij- 


e. 
I 
na 


The uniqueness property of the solution to the absorption probability equations 
requires a separate argument, which is given in the end-of-chapter problems. 

The next example illustrates how we can use the preceding method to 
calculate the probability of entering a given recurrent class (rather than a given 
absorbing state). 


364 Markov Chains Chap. 7 


Example 7.10. Consider the Markov chain shown in Fig. 7.17(a). Note that 
there are two recurrent classes, namely (1) and (4,5). We would like to calculate 
the probability that the state eventually enters the recurrent class (4,5) starting 
from one of the transient states. For the purposes of this problem, the possible 
transitions within the recurrent class {4,5} are immaterial. We can therefore lump 
the states in this recurrent class and treat them as a single absorbing state (call it 
state 6). as in Fig. 7.17(b). It then suffices to compute the probability of eventually 
entering state 6 in this new chain. 





Figure 7.17: (a) Transition probability graph in Example 7.10. (b) A new 
graph in which states 4 and 5 have been lumped into the absorbing state 6. 


The probabilities of eventually reaching state 6, starting from the transient 
states 2 and 3, satisfy the following equations: 


az = 0.2a; + 0.3a2 + 0.403 + 0. las. 
a3 = 0.2a2 + 0.8as. 


Using the facts a, = 0 and ag = 1, we obtain 


az = 0.3a» + 0.4a3 + 0.1. 
a3 = 0.2a2 + 0.8. 


This is a system of two equations in the two unknowns a2 and a3. which can be 
readily solved to yield a2 = 21/31 and a3 = 29/31. 


Example 7.11. Gambler’s Ruin. A gambler wins $1 at each round. with 
probability p, and loses $1, with probability 1 — p. Different rounds are assumed 
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independent. The gambler plays continuously until he either accumulates a tar- 
get amount of $m, or loses all his money. What is the probability of eventually 
accumulating the target amount (winning) or of losing his fortune? 

We introduce the Markov chain shown in Fig. 7.18 whose state i represents 
the gambler's wealth at the beginning of a round. The states i = 0 and i = m 
correspond to losing and winning, respectively. 

All states are transient, except for the winning and losing states which are 
absorbing. Thus, the problem amounts to finding the probabilities of absorption 
at each one of these two absorbing states. Of course, these absorption probabilities 
depend on the initial state i. 


1 p p 1 
p 
CY9— GE EEES aro 
lose " l-p l-p win 


Figure 7.18: Transition probability graph for the gambler’s ruin problem 
(Example 7.11). Here m = 4. 


Let us set s = m in which case the absorption probability a, is the probability 
of winning, starting from state i. These probabilities satisfy 


ao — 0 
a, = (1 — p)ai-1 + pais1, i=1,...,;m-1, 
an = 


These equations can be solved in a variety of ways. It turns out there is an elegant 
method that leads to a nice closed form solution. 
Let us write the equations for the a; as 


(1 — p)(ai — ai-1) = p(ais1 — ai), à—]l,...,m- l1. 


Then, by denoting 


and 


the equations are written as 
ó, = pói-1, i—l,...,m-]l. 


from which we obtain 


366 Markov Chains Chap. 7 


This, together with the equation ôo + 61 +---+6m-1 = am — ao = 1, implies that 
(1 p o" 71)6s = 1, 


and 
1 


ôo = L——————. 
0 1+p+---+p™-} 


Since ao = 0 and ai41 = ai + ôi, the probability a; of winning starting from a 
fortune 2 is equal to 
Qi = ôo + 61 +--+ 
=(1+p+-- p l)ós 
1tpt: p^ 
l+pt+---+pm-)’ 


which simplifies to 





1- p : 

f 1 
ppm p*l 
i 


m if p= 1. 
The solution reveals that if p > 1, which corresponds to p < 1/2 and unfa- 
vorable odds for the gambler, the probability of winning approaches 0 as m — oo, 


for any fixed initial fortune. This suggests that if you aim for a large profit under 
unfavorable odds, financial ruin is almost certain. 


Expected Time to Absorption 


We now turn our attention to the expected number of steps until a recurrent 
state is entered (an event that we refer to as “absorption”), starting from a 
particular transient state. For any state i, we denote 


li = E [number of transitions until absorption, starting from i] 
= E[min(n > 0| X, is recurrent} | Xo i]. 


Note that if i is recurrent, then u; = 0 according to this definition. 

We can derive equations for the u; by using the total expectation theorem. 
We argue that the time to absorption starting from a transient state į is equal to 
1 plus the expected time to absorption starting from the next state, which is j 
with probability p;;. We then obtain a system of linear equations, stated below, 
which is known to have a unique solution (see Problem 33 for the main idea). 
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Equations for the Expected Time to Absorption 


The expected times to absorption, u1,..., 4m, are the unique solution to 
the equations 


ki = 0, for all recurrent states 7, 


m 
Hi—l-c pan; for all transient states 2. 
j=l 








Example 7.12. Spiders and Fly. Consider the spiders-and-fly model of Ex- 
ample 7.2. This corresponds to the Markov chain shown in Fig. 7.19. The states 
correspond to possible fly positions, and the absorbing states 1 and m correspond 
to capture by a spider. 

Let us calculate the expected number of steps until the fly is captured. We 
have 

H1 = Um = 0, 
and 
fr = 14+0.3uj-1 + 0.4, + 0.3041. for i= 2... m-l. 


We can solve these equations in a variety of ways, such as for example by 
successive substitution. As an illustration, let m = 4, in which case, the equations 
reduce to 


u2 = 1 + 0.4 + 0.343, Ha = l + 0.355 + 0.4u3. 


The first equation yields u2 = (1/0.6) + (1/2)u3, which we can substitute in the 
second equation and solve for u3. We obtain u3 = 10/3 and by substitution again. 
u2 = 10/3. 


0.4 0.4 0:4 0.4 





Figure 7.19: Transition probability graph in Example 7.12. 


Mean First Passage and Recurrence Times 


The idea used to calculate the expected time to absorption can also be used 
to calculate the expected time to reach a particular recurrent state. starting 
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from any other state. For simplicity, we consider a Markov chain with a single 
recurrent class. We focus on a special recurrent state s, and we denote by t; the 
mean first passage time from state i to state s, defined by 


t= E [number of transitions to reach s for the first time, starting from i] 
= E[min(n 2 0| X, = s) | Xo = i]. 


The transitions out of state s are irrelevant to the calculation of the mean 
first passage times. We may thus consider a new Markov chain which is identical 
to the original, except that the special state s is converted into an absorbing 
state (by setting pss = 1, and psj = 0 for all j # s). With this transformation, 
all states other than s become transient. We then compute t; as the expected 
number of steps to absorption starting from i, using the formulas given earlier 
in this section. We have 


m 

ti=1+ > pists, for all i Æ s, 
j=l 

ts = 0. 


This system of linear equations can be solved for the unknowns ¢;, and has a 
unique solution (see the end-of-chapter problems). 

The above equations give the expected time to reach the special state s 
starting from any other state. We may also want to calculate the mean recur- 
rence time of the special state s, which is defined as 


t$ = E[number of transitions up to the first return to s, starting from s] 
= E| min{n > 1| X4 = s) | Xo=s]. 


We can obtain t3, once we have the first passage times t;, by using the equation 


T 
ti =1+ J psjtj. 
j=1 


To justify this equation, we argue that the time to return to s, starting from s, 
is equal to 1 plus the expected time to reach s from the next state, which is j 
with probability psj. We then apply the total expectation theorem. 


Example 7.13. Consider the “up-to-date” - “behind” model of Example 7.1. States 
1 and 2 correspond to being up-to-date and being behind, respectively, and the 
transition probabilities are 


pii = 0.8, pi2 = 0.2, 


pa = 0.6, p22 = 0.4. 
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Let us focus on state s — 1 and calculate the mean first passage time to state 1. 
starting from state 2. We have t; = 0 and 


t2 = 1 + pziti + poate = 1 + 0.4t2. 


from which 
M UNE: 
"T 46 3 


6 
The mean recurrence time to state 1 is given by 
* 9 4 
ti = l + piti + pi2t(22 1-07 0.2- 3 = 3 


Equations for Mean First Passage and Recurrence Times 


Consider a Markov chain with a single recurrent class, and let s be a par- 
ticular recurrent state. 


e The mean first passage times t; to reach state s starting from i, are 
the unique solution to the system of equations 


m 
t, — 0, ti=1+ o pists, for all i £ s. 
j=l 
e The mean recurrence time tš of state s is given by 


m 
elg» py: 
j=l 





7.5 CONTINUOUS-TIME MARKOV CHAINS 


In the Markov chain models that we have considered so far. we have assumed 
that the transitions between states take unit time. In this section. we consider a 
related class of models that evolve in continuous time and can be used to study 
systems involving continuous-time arrival processes. Examples are distribution 
centers or nodes in conununication networks where some events of interest (e.g.. 
arrivals of orders or of new calls) are described in terms of Poisson processes. 

As before, we will consider a process that involves transitions from one 
state to the next, according to given transition probabilities, but we will model 
the times spent between transitions as continuous random variables. We will 
still assume that the number of states is finite and. in the absence of a statement 
to the contrary, we will let the state space be the set S = {1..... m). 
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To describe the process, we introduce certain random variables of interest: 


Xn: the state right after the nth transition; 
Yn: the time of the nth transition: 
Tn: the time elapsed between the (n — 1)st and the nth transition. 


For completeness, we denote by Xo the initial state, and we let Yo = 0. We also 
introduce some assumptions. 


Continuous-Time Markov Chain Assumptions 


e If the current state is ?, the time until the next transition is exponen- 
tially distributed with a given parameter vj, independent of the past 


history of the process and of the next state. 


e [f the current state is i, the next state will be j with a given probability 
Dij, independent of the past history of the process and of the time until 
the next transition. 





The above assumptions are a complete description of the process and pro- 
vide an unambiguous method for simulating it: given that we just entered state 
i, weremain at state i for a time that is exponentially distributed with parameter 
vi, and then move to a next state j according to the transition probabilities pi;. 
As an immediate consequence, the sequence of states Xn obtained after succes- 
sive transitions is a discrete-time Markov chain, with transition probabilities pij, 
called the embedded Markov chain. In mathematical terms, our assumptions 
can be formulated as follows. Let 


A= (Ti =t1,...,Tn = tn; Xo = do,.... Xn-1 = in-1, Xn = i} 


be an event that captures the history of the process until the nth transition. We 
then have 


= pije-"it, for all t > 0. 


The expected time to the next transition is 


d 1 
E[Tn+1 | Xn = i] = n Tvie-"i' dr = —, 
0 


Vi 


so we can interpret v; as the average number of transitions out of state i, per unit 
time spent at state i. Consequently, the parameter vj is called the transition 
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rate out of state i. Since only a fraction pij of the transitions out of state i 
will lead to state j, we may also view 


qij = Vipij 


as the average number of transitions from i to j, per unit time spent at i. 
Accordingly, we call qi; the transition rate from i to j. Note that given the 
transition rates qij, one can obtain the transition rates v; using the formula 


m 


n= S dq 


j=l 
and the transition probabilities using the formula 


dij 
pi = —. 
Vi 


Note that the model allows for self-transitions, from a state back to itself, 
which can indeed happen if a self-transition probability pj; is nonzero. However, 
such self-transitions have no observable effects: because of the memorylessness 
of the exponential distribution, the remaining time until the next transition is 


the same, irrespective of whether a self-transition just occurred or not. For this 
reason, we can ignore self-transitions and we will henceforth assume that 


Pi = qiu = 0, for all i. 


Example 7.14. A machine, once in production mode, operates continuously until 
an alarm signal is generated. The time up to the alarm signal is an exponential 
random variable with parameter 1. Subsequent to the alarm signal, the machine is 
tested for an exponentially distributed amount of time with parameter 5. The test 
results are positive, with probability 1/2, in which case the machine returns to pro- 
duction mode, or negative, with probability 1/2, in which case the machine is taken 
for repair. The duration of the repair is exponentially distributed with parameter 
3. We assume that the above mentioned random variables are all independent and 
also independent of the test results. 

Let states 1, 2, and 3, correspond to production mode, testing, and repair, 
respectively. The transition rates are vı = 1, v2 = 5, and v3 = 3. The transition 
probabilities and the transition rates are given by the following two matrices: 


0 1 0 0 1 0 
p= fij 0 ye, o= wa 0 sa. 
1 0 0 3 0 0 


See Fig. 7.20 for an illustration. 
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Figure 7.20: Illustration of the Markov chain in Example 7.14. The quantities 
indicated next to each arc are the transition rates qij. 


We finally note that the continuous-time process we have described has a 
Markov property similar to its discrete-time counterpart: the future is indepen- 
dent of the past, given the present. To see this, denote by X(t) the state of a 
continuous-time Markov chain at time t > 0, and note that it stays constant 
between transitions.! Let us recall the memorylessness property of the exponen- 
tial distribution, which in our context implies that for any time t between the 
nth and (n + 1)st transition times Yn and Yn+1, the additional time Yn+1 — t 
until the next transition is independent of the time t — Y, that the system has 
been in the current state. It follows that for any time t, and given the present 
state X(t), the future of the process [the random variables X(r) for 7 > t], is 
independent of the past [the random variables X (7) for r < t]. 


Approximation by a Discrete- Time Markov Chain 


We now elaborate on the relation between a continuous-time Markov chain and 
a corresponding discrete-time version. This relation will lead to an alternative 
description of à continuous-time Markov chain, and to a set of balance equations 
characterizing the steady-state behavior. 

Let us fix à small positive number ó and consider the discrete-time Markov 
chain Z4 that is obtained by observing X(t) every 6 time units: 


Zn = X (nô), n=0.]1,.. 


The fact that Z, is a Markov chain (the future is independent from the past, 
given the present) follows from the Markov property of X(t). We will use the 
notation p;; to describe the transition probabilities of Zn. 

Given that Z4 = i, there is a probability approximately equal to vjó that 
there is a transition between times nó and (n + 1)ó, and in that case there is a 


| If a transition takes place at time t, the notation X(t) is ambiguous. A common 
convention is to let X (t) refer to the state right after the transition, so that X(Y%,) is 
the same as Xn. 
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further probability pi; that the next state is j. Therefore, 
Di; = P(Zn+1 = j| Zn =i) = vipijó + o(0) = qijó + o(), if j zi, 


where o(ó) is a term that is negligible compared to 6, as 6 gets smaller. The 
probability of remaining at i [i.e., no transition occurs between times nó and 


(n + 1)0] is 
Pa -P(Zni-i|Z4-i)-1 -$ Py 
jzi 


This gives rise to the following alternative description. t 





Alternative Description of a Continuous-Time Markov Chain 


Given the current state i of a continuous-time Markov chain, and for any 
j #1, the state 6 time units later is equal to j with probability 


qijÓ + o(ô), 


independent of the past history of the process. 





Example 7.14 (continued). Neglecting o(Ó) terms, the transition probability 
matrix far the corresponding discrete-time Markov chain Zn is 


56/2 1—56 56/2 


1-6 ô 0 | 
3ó 0 1— 36 


Example 7.15. Queueing. Packets arrive at a node of a communication network 
according to a Poisson process with rate A. The packets are stored at a buffer with 
room for up to m packets, and are then transmitted one at a time. However, if a 
packet finds a full buffer upon arrival, it is discarded. The time required to transmit 
a packet is exponentially distributed with parameter u. The transmission times of 
different packets are independent and are also independent from all the interarrival 
times. 

We will model this system using a continuous-time Markov chain with state 
X(t) equal to the number of packets in the system at time t [if X(t) > 0, then 


T Our argument so far shows that a continuous-time Markov chain satisfies this 
alternative description. Conversely, it can be shown that if we start with this alter- 
native description, the time until a transition out of state 2 is an exponential random 
variable with parameter vj — Du; qij. Furthermore, given that such a transition has 
just occurred, the next state is 7 with probability qi;/vi = pij. This establishes that 
the alternative description is equivalent to the original one. 
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X(t) — 1 packets are waiting in the queue and one packet is under transmission]. 
The state increases by one when a new packet arrives and decreases by one when 
an existing packet departs. To show that X(t) is indeed a Markov chain, we verify 
that we have the property specified in the above alternative description, and at the 
same time identify the transition rates qij. 

Consider first the case where the system is empty, i.e., the state X (t) is equal 
to 0. A transition out of state 0 can only occur if there is a new arrival, in which 
case the state becomes equal to 1. Since arrivals are Poisson, we have 


P(X(t+6) — 1| X(t) = 0) = Ad + o(6), 


A, iff =1, 
qoj = $ 0, otherwise. 
Consider next the case where the system is full, i.e., the state X (t) is equal 
tom. A transition out of state m will occur upon the completion of the current 


packet transmission, at which point the state will become m — 1. Since the duration 
of a transmission is exponential (and in particular, memoryless), we have 


P(X(t+ ô) 2 m — 1| X(t) 2 m) = pô + o(8), 


and 


and 
| fum, ifjom-l1, 
duit re otherwise. 

Consider finally the case where X(t) is equal to some intermediate state i, 
with 0 <i € m. During the next 6 time units, there is a probability A + o(ô) of a 
new packet arrival, which will bring the state to i+ 1, and a probability ô + o(ó) 
that a packet transmission is completed, which will bring the state toi — 1. [The 
probability of both an arrival and a departure within an interval of length 6 is of 
the order of 6? and can be neglected, as is the case with other o(ô) terms.] Hence, 


P(X(t+ 6) 2i—1|X(t) =i) = pô + (8), 
P(X(t+6) =i+1| X(t) =i) = 6 o(8), 


and 
A, fj7 =724+1, 
a= Cn, ifj2i-l, for i = 1,2,...,m — 1; 
0 otherwise, 
see Fig. 7.21. 


À À À A 
(ED Eee E 
u H u u 


Figure 7.21: Transition graph in Example 7.15. 
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Steady-State Behavior 


We now turn our attention to the long-term behavior of a continuous-time 
Markov chain and focus on the state occupancy probabilities P( X(t) = i), in 
the limit as t gets large. We approach this problem by studying the steady-state 
probabilities of the corresponding discrete-time chain Zn. 

Since Zn = X(nó), it is clear that the limit 7; of P(Z, = j|Zo = i), 
if it exists, is the same as the limit of P(X(t) = j| X(0) = i). It therefore 
suffices to consider the steady-state probabilities associated with Z,. Reasoning 
as in the discrete-time case, we see that for the steady-state probabilities to be 
independent of the initial state, we need the chain Z, to have a single recurrent 
class, which we will henceforth assume. We also note that the Markov chain Z, 
is automatically aperiodic. This is because the self-transition probabilities are 
of the form 

D,-41- "arr + o(ô), 
jfi 
which is positive when 6 is small, and because chains with nonzero self-transition 
probabilities are always aperiodic. 
The balance equations for the chain Z; are of the form 


m 
Tj = 5 Tk Pkj for all j, 
k=1 


or 
kzj 


—-53;|1- 83 qik ct o(6) | + Y n (9436 + o(6)). 


kzj kzj 


We cancel out the term 7; that appears on both sides of the equation, divide by 
6, and take the limit as 6 decreases to zero, to obtain the balance equations 


Tj Qik = 2 Triki. 


kzj kzj 


We can now invoke the Steady-State Convergence Theorem for the chain Z, to 
obtain the following. 





Steady-State Convergence Theorem 


Consider à continuous-time Markov chain with a single recurrent class. 
Then, the states j are associated with steady-state probabilities 7; that 
have the following properties. 
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(a) For each j, we have 


Jim P(X(t) = j|X(0) = i) = 7j, for all i. 


(b) The 7; are the unique solution to the system of equations below: 


Tj Y qik = > TRIS, j=1,..-,m, 


kžj kzj 
m 
I= Son ke 
k=1 
(c) We have 
7j = 0, for all transient states 7, 
7j > 0, for all recurrent states 7. 





To interpret the balance equations, we view 7; as the expected long-term 
fraction of time the process spends in state j. It follows that mxqx; can be 
viewed as the expected frequency of transitions from k to j (expected number 
of transitions from k to j per unit time). It is seen therefore that the balance 
equations express the intuitive fact that the frequency of transitions out of state 
j (the left-hand side term 7; Rss qjk) is equal to the frequency of transitions 
into state j (the right-hand side term 5 ^,,; ™kQkj)- 


Example 7.14 (continued). The balance and normalization equations for this 
example are 


5 5 
T) = 572373, 572 = T), 373 = 272: 


1 = mi -c724373. 


As in the discrete-time case, one of these equations is redundant, e.g., the third 
equation can be obtained from the first two. Still. there is a unique solution: 
30 6 ii 5 
~ Al’ 5 41 
Thus, for example, if we let the process run for a long time, X(t) will be at state 1 
with probability 30/41, independent of the initial state. 

The steady-state probabilities 7; are to be distinguished from the steady- 
state probabilities 7; of the embedded Markov chain Xn. Indeed, the balance and 
normalization equations for the embedded Markov chain are 


E | ne Ex c: E 
Ty = 972 + 73, 72 = 7). 73 = —T725, 


2 
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1=71+72+4+7s3, 
yielding the solution 


TON" TRES e cd 

TSE T2 = 5) my 
To interpret the probabilities 7;, we can say, for example, that if we let the process 
run for a long time, the expected fraction of transitions that lead to state 1 is equal 
to 2/5. 

Note that even though 71 = 72 (that is, there are about as many transitions 
into state 1 as there are transitions into state 2), we have mı > 72. The reason is 
that the process tends to spend more time during a typical visit to state 1 than 
during a typical visit to state 2. Hence, at a given time t, the process X(t) is 
more likely to be found at state 1. This situation is typical, and the two sets of 
steady-state probabilities (7; and 7;) are generically different. The main exception 
arises in the special case where the transition rates v; are the same for all 2; see the 
end-of-chapter problems. 


Birth-Death Processes 


As in the discrete-time case, the states in a birth-death process are linearly 
arranged and transitions can only occur to a neighboring state, or else leave the 
state unchanged; formally, we have 


qij = 0, for |i — j| > 1. 


In a birth-death process, the long-term expected frequencies of transitions from 
i to j and of transitions from j to i must be the same, leading to the local 
balance equations 

TjQji = TiQij, for all i, j. 


The local balance equations have the same structure as in the discrete-time case, 
leading to closed-form formulas for the steady-state probabilities. 


Example 7.15 (continued). The local balance equations take the form 
TiÀ = Ti+ h, 1—0,1,...,m- l, 


and we obtain ;4; = pri, where p = A/p. Thus, we have mi = p nto for all i. The 
normalization equation 1 = 5 7^ , s; yields 


and the steady-state probabilities are 


t 


p . 
T — HÓA 
HU l1Tptecpm di 


7.6 
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SUMMARY AND DISCUSSION 


In this chapter, we have introduced Markov chain models with a finite number 
of states. In a discrete-time Markov chain, transitions occur at integer times 
according to given transition probabilities pi;. The crucial property that dis- 
tinguishes Markov chains from general random processes is that the transition 
probabilities pi; apply each time that the state is equal to i, independent of the 
previous values of the state. Thus, given the present, the future of the process 
is independent of the past. 

Coming up with a suitable Markov chain model of a given physical situ- 
ation is to some extent an art. In general, we need to introduce a rich enough 
set of states so that the current state summarizes whatever information from 
the history of the process is relevant to its future evolution. Subject to this 
requirement, we usually aim at a model that does not involve more states than 
necessary. 

Given a Markov chain model, there are several questions of interest. 


(a) Questions referring to the statistics of the process over a finite time horizon. 
We have seen that we can calculate the probability that the process follows a 
particular path by multiplying the transition probabilities along the path. 
The probability of à more general event can be obtained by adding the 
probabilities of the various paths that lead to the occurrence of the event. 
In some cases, we can exploit the Markov property to avoid listing each and 
every path that corresponds to a particular event. A prominent example 
is the recursive calculation of the n-step transition probabilities, using the 
Chapman-Kolmogorov equations. 


(b) Questions referring to the steady-state behavior of the Markov chain. To 
address such questions, we classified the states of a Markov chain as tran- 
sient and recurrent. We discussed how the recurrent states can be divided 
into disjoint recurrent classes, so that each state in a recurrent class is ac- 
cessible from every other state in the same class. We also distinguished 
between periodic and aperiodic recurrent classes. The central result of 
Markov chain theory is that if a chain consists of a single aperiodic recur- 
rent class, plus possibly some transient states, the probability r;;(n) that 
the state is equal to some j converges, as time goes to infinity, to a steady- 
state probability 7;, which does not depend on the initial state i. In other 
words, the identity of the initial state has no bearing on the statistics of 
Xn when n is very large. The steady-state probabilities can be found by 
solving a system of linear equations, consisting of the balance equations 
and the normalization equation 5 ;; Tj = 1. 


(c) Questions referring to the transient behavior of a Markov chain. We dis- 
cussed the absorption probabilities (the probability that the state eventu- 
ally enters a given recurrent class, given that it starts at a given transient 
state), and the mean first passage times (the expected time until a particu- 
lar recurrent state is entered, assuming that the chain has a single recurrent 
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class). In both cases, we showed that the quantities of interest can be found 
by considering the unique solution to a system of linear equations. 


We finally considered continuous-time Markov chains. In such models, 
given the current state, the next state is determined by the same mechanism as 
in discrete-time Markov chains. However, the time until the next transition is an 
exponentially distributed random variable, whose parameter depends only the 
current state. Continuous-time Markov chains are in many ways similar to their 
discrete-time counterparts. They have the same Markov property (the future 
is independent from the past, given the present). In fact, we can visualize a 
continuous-time Markov chain in terms of a related discrete-time Markov chain 
obtained by a fine discretization of the time axis. Because of this correspon- 
dence, the steady-state behaviors of continuous-time and discrete-time Markov 
chains are similar: assuming that there is a single recurrent class, the occupancy 
probability of any particular state converges to a steady-state probability that 
does not depend on the initial state. These steady-state probabilities can be 
found by solving a suitable set of balance and normalization equations. 
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PROBLEMS 





SECTION 7.1. Discrete- Time Markov Chains 


Problem 1. The times between successive customer arrivals at a facility are inde- 
pendent and identically distributed random variables with the following PMF: 


0.2, k=1, 

0.3, k=3, 
PAS) ge hea, 

0, otherwise. 


Construct a four-state Markov chain model that describes the arrival process. In this 
model, one of the states should correspond to the times when an arrival occurs. 


Problem 2. A mouse moves along a tiled corridor with 2m tiles, where m > 1. From 
each tile i # 1,2m. it moves to either tile i — 1 or i+ 1 with equal probability. From 
tile 1 or tile 2m, it moves to tile 2 or 2m — 1, respectively, with probability 1. Each 
time the mouse moves to a tile i € m or i > m, an electronic device outputs a signal 
L or R, respectively. Can the generated sequence of signals L and R be described as a 
Markov chain with states L and R? 


Problem 3. Consider the Markov chain in Example 7.2, for the case where m = 4, 
as in Fig. 7.2. and assume that the process starts at any of the four states, with equal 
probability. Let Y; = 1 whenever the Markov chain is at state 1 or 2, and Yn = 2 
whenever the Markov chain is at state 3 or 4. Is the process Y, a Markov chain? 


SECTION 7.2. Classification of States 


Problem 4. A spider and a fly move along a straight line in unit increments. The 
spider always moves towards the fly by one unit. The fly moves towards the spider by 
one unit with probability 0.3, moves away from the spider by one unit with probability 
0.3, and stays in place with probability 0.4. The initial distance between the spider 
and the fly is integer. When the spider and the fly land in the same position, the spider 
captures the fly. 


(a) Construct a Markov chain that describes the relative location of the spider and 
fly. 


(b) Identify the transient and recurrent states. 
Problem 5. Consider a Markov chain with states 1, 2,...,9. and the following tran- 


sition probabilities: p12 = pı7 = 1/2, pi(is1) = 1 for i # 1,6,9, and pe1 = pai = 1. Is 
the recurrent class of the chain periodic or not? 
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Problem 6.* Existence of a recurrent state. Show that in a Markov chain at 
least one recurrent state must be accessible from any given state, i.e., for any i, there 
is at least one recurrent j in the set A(z) of accessible states from i. 


Solution. Fix a state i. If i is recurrent, then every j € A(z) is also recurrent so we 
are done. Assume that i is transient. Then, there exists a state i; € A(z) such that 
i € A(à1). If i; is recurrent, then we have found a recurrent state that is accessible from 
i, and we are done. Suppose now that i; is transient. Then, i, # i because otherwise 
the assumptions à € A(i) and i ¢ A(i1) would yield i € A(z) and i € A(z). which is 
a contradiction. Since i, is transient, there exists some 22 such that i2 € A(?1) and 
i1 € A(12). In particular, i2 € A(z). If i2 is recurrent, we are done. So, suppose that i2 
is transient. The same argument as before shows that i2 # i1. Furthermore, we must 
also have i2 # i. This is because if we had i2 = i, we would have i; € A(z) = A(i2), 
contradicting the assumption i; ¢ A(i2). Continuing this process, at the kth step, we 
will either obtain a recurrent state i, which is accessible from i, or else we will obtain 
a transient state ip which is different than all the preceding states 2, 21,..., 2&1. Since 
there is only a finite number of states, a recurrent state must ultimately be obtained. 


Problem 7.* Consider a Markov chain with some transient and some recurrent states. 


(a) Show that for some numbers c and ~y, with c > 0 and 0 < y < 1, we have 


P(X, is transient | Xo = i) € cy". for alli and n> 1. 


(b) Let T be the first time n at which X, is recurrent. Show that such a time is 
certain to exist (i.e., the probability of the event that there exists a time n at 
which X; is recurrent is equal to 1) and that E[T] « oo. 


Solution. (a) For notational convenience, let 
qi(n) = P(X, transient | Xo = i). 


A recurrent state that is accessible from state i can be reached in at most m steps, 
where m is the number of states. Therefore, qi(m) < 1. Let 


B= max qi(m) 
i=1,...,m 


and note that for all i, we have qi(m) € 8 < 1. If a recurrent state has not been reached 
by time m, which happens with probability at most 2, the conditional probability that 
a recurrent state is not reached in the next m steps is at most as well, which suggests 
that gi(2m) € 8?. Indeed, conditioning on the possible values of Xm, we obtain 


qi(2m) = P (Xon transient | Xo = i) 
= 5 P( Xam transient | Xm = j, Xo 2 i)P(Xm = 3| Xo i) 
j transient 


= 5 P(X2m transient | Xm = j) P(X& = j| Xo = i) 


3 transient 


= 5 P(Xm transient | Xo = j) P(Xm = j| Xo = i) 


j transient 
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«8 M P(X&-jlXo- i) 


j transient 
= BP(X,, transient | Xo = i) 
< 6°. 


Continuing similarly, we obtain 
gi(km) < p", for all i and k > 1. 


Let n be any positive integer, and let k be the integer such that km < n < 
(k+ 1)m. Then, we have 


(k+1)m 


qi(n) < gi(km) < Bt = 87 (8) < g^! (9/7)" 


Thus, the desired relation holds with c = 8^! and y = gm. 


(b) Let A be the event that the state never enters the set of recurrent states. Using 
the result from part (a), we have 


P(A) € P(X, transient) < cy”. 


Since this is true for every n and since y < 1, we must have P(A) = 0. This establishes 
that there is certainty (probability equal to 1) that there is a finite time T that a 
recurrent state is first entered. We then have 


E[T] = X nP(Xn-1 transient, X, recurrent) 


n=1 


< Y nP(Xn-1 transient) 


n=l 


oo 
< 3 ney" 
n=l 


oo 
c = 
=> h nl -ar 
M 
C 
(1-7)? 


where the last equality is obtained using the expression for the mean of the geometric 
distribution. 





Problem 8.* Recurrent states. Show that if a recurrent state is visited once, the 
probability that it will be visited again in the future is equal to 1 (and, therefore, the 
probability that it will be visited an infinite number of times is equal to 1). Hint: 
Modify the chain to make the recurrent state of interest the only recurrent state, and 
use the result of Problem 7(b). 


Solution. Let s be a recurrent state, and suppose that s has been visited once. From 
then on, the only possible states are those in the same recurrence class as s. Therefore, 
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without loss of generality, we can assume that there is a single recurrent class. Suppose 
that the current state is some i # s. We want to show that s is guaranteed to be visited 
some time in the future. 

Consider a new Markov chain in which the transitions out of state s are disabled, 
so that pss = 1. The transitions out of states i, for à # s are unaffected. Clearly, s 
is recurrent in the new chain. Furthermore, for any state i # s, there is a positive 
probability path from i to s in the original chain (since s is recurrent in the original 
chain), and the same holds true in the new chain. Since i is not accessible from s in 
the new chain, it follows that every i # s in the new chain is transient. By the result 
of Problem 7(b), state s will be eventually visited by the new chain (with probability 
1). But the original chain is identical to the new one until the time that s is first 
visited. Hence, state s is guaranteed to be eventually visited by the original chain s. 
By repeating this argument, we see that s is guaranteed to be visited an infinite number 
of times (with probability 1). 


Problem 9.* Periodic classes. Consider a recurrent class R. Show that exactly 
one of the following two alternatives must hold: 


(i) The states in R can be grouped in d > 1 disjoint subsets $1,..., Sa, so that all 
transitions from Sx lead to Sk+1, or to Sı if k = d. (In this case, R is periodic.) 


(ii) There exists a time n such that ri;(n) > 0 for all i,j € R. (In this case R is 
aperiodic.) 


Hint: Fix a state i and let d be the greatest common divisor of the elements of the set 
Q = {n|rii(n) > 0). If d= 1, use the following fact from elementary number theory: if 
the positive integers a1, @2, . . . have no common divisor other than 1, then every positive 
integer n outside a finite set can be expressed in the form n = kiaı + kaa2 +--+ ktat 
for some nonnegative integers k,,...,k:, and some t > 1. 


Solution. Fix a state i and consider the set Q = {n|rii(n) > 0). Let d be the 
greatest common divisor of the elements of Q. Consider first the case where d # 1. 
For k = 1,...,d, let S& be the set of all states that are reachable from i in ld+ k 
steps for some nonnegative integer l. Suppose that s € Sy and p,, > 0. Since s € Sx, 
we can reach s from i in ld + k steps for some l, which implies that we can reach s’ 
from i in ld + k + 1 steps. This shows that s' € S&41 if k < d, and that s' € Sı if 
k = d. It only remains to show that the sets S1,..., Sq are disjoint. Suppose, to derive 
a contradiction, that s € Sx and s € S, for some k Æ k'. Let q be the length of some 
positive probability path from s to i. Starting from i, we can get to s in ld + k steps, 
and then return to i in q steps. Hence ld + k + q belongs to Q. which implies that d 
divides k+ q. By the same argument, d must also divide k’ +q. Hence d divides k — k’, 
which is a contradiction because 1 < |k — k'| < d — 1. 

Consider next the case where d — 1. Let Q — (01,02,...). Since these are the 
possible lengths of positive probability paths that start and end at i, it follows that 
any integer n of the form n = kya) + k202 +--- + k:a is also in Q. (To see this, use 
ky times a path of length aj, followed by using kz times a path of length a2, etc.) By 
the number-theoretic fact given in the hint, the set Q contains all but finitely many 
positive integers. Let n; be such that 


rü(n) > 0, for all n » ni. 


Fix some j # i and let q be the length of a shortest positive probability path from i 
to j, so that q < m, where m is the number of states. Consider some n that satisfies 
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n > ni +m, and note that n — q > ni - m —q > ni. Thus, we can go from i to itself in 
n — q steps, and then from 7 to j in q steps. Therefore, there is a positive probability 
path from i to j, of length n, so that ri;(n) > 0. 

We have so far established that at least one of the alternatives given in the 
problem statement must hold. To establish that they cannot hold simultaneously, note 
that the first alternative implies that ri;(n) = 0 whenever n is not an integer multiple 
of d, which is incompatible with the second alternative. 

For completeness, we now provide a proof of the number-theoretic fact that was 
used in this problem. We start with the set of positive integers a1,a2,..., and assume 
that they have no common divisor other than 1. We define M as the set of all positive 
integers the form » ; kiai, where the ki are nonnegative integers. Note that this set 
is closed under addition (the sum of two elements of M is of the same form and must 
also belong to M). Let g be the smallest difference between two distinct elements of 
M. Then, g > 1 and g < a; for all 7, since a, and 2a; both belong to M. 

Suppose that g > 1. Since the greatest common divisor of the a; is 1, there exists 
some a;* which is not divisible by g. We then have 


aw = Ég t m, 


for some positive integer £, where the remainder r satisfies 0 < r < g. Furthermore, in 
view of the definition of g, there exist nonnegative integers kı, k,..., kt, kt; such that 


t t 
as kiai = 5y kiai +g. 
i=l i=l 


Multiplying this equation by £ and using the equation a;« = £g + r, we obtain 


t t t 
3 (tki)ai = M (€ki)ai + lg = V (€ki)ai + ais — r. 
1=1 1=1 i=1 


This shows that there exist two numbers in the set M, whose difference is equal to r. 
Since 0 < r < g, this contradicts our definition of g as the smallest possible difference. 
This contradiction establishes that g must be equal to 1. 

Since g = 1, there exists some positive integer z such that z € M andz+1 € M. 
We will now show that every integer n larger than ax belongs to M. Indeed, by 
dividing n by ai, we obtain n = ko; +r, where k > z and where the remainder r 
satisfies 0 € r < aj. We rewrite n in the form 


n = z(ai — r) t (z + 1)r + (k 2 z)ai. 


Since z, z+ 1, and a all belong to M, this shows that n is the sum of elements of M 
and must also belong to M, as desired. 


SECTION 7.3. Steady-State Behavior 


Problem 10. Consider the two models of machine failure and repair in Example 7.3. 
Find conditions on b and r for the chain to have a single recurrent class which is 
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aperiodic and, under those conditions, find closed form expressions for the steady-state 
probabilities. 


Problem 11. A professor gives tests that are hard, medium, or easy. If she gives a 
hard test, her next test will be either medium or easy, with equal probability. However, 
if she gives a medium or easy test, there is a 0.5 probability that her next test will be of 
the same difficulty, and a 0.25 probability for each of the other two levels of difficulty. 
Construct an appropriate Markov chain and find the steady-state probabilities. 


Problem 12. Alvin likes to sail each Saturday to his cottage on a nearby island off 
the coast. Alvin is an avid fisherman, and enjoys fishing off his boat on the way to and 
from the island, as long as the weather is good. Unfortunately, the weather is good on 
the way to or from the island with probability p, independent of what the weather was 
on any past trip (so the weather could be nice on the way to the island, but poor on 
the way back). Now, if the weather is nice, Alvin will take one of his n fishing rods for 
the trip, but if the weather is bad, he will not bring a fishing rod with him. We want 
to find the probability that on a given leg of the trip to or from the island the weather 
will be nice, but Alvin will not fish because all his fishing rods are at his other home. 


(a) Formulate an appropriate Markov chain model with n + 1 states and find the 
steady-state probabilities. 


(b) What is the steady-state probability that on a given trip, Alvin sails with nice 
weather but without a fishing rod? 


Problem 13. Consider the Markov chain in Fig. 7.22. Let us refer to a transition 
that results in a state with a higher (respectively, lower) index as a birth (respectively, 
death). Calculate the following quantities, assuming that when we start observing the 
chain, it is already in steady-state. 





Figure 7.22: Transition probability graph for Problem 11. 


(a) For each state i, the probability that the current state is i. 
(b) The probability that the first transition we observe is a birth. 
(c) The probability that the first change of state we observe is a birth. 


(d) The conditional probability that the process was in state 2 before the first tran- 
sition that we observe, given that this transition was a birth. 


(e) The conditional probability that the process was in state 2 before the first change 
of state that we observe, given that this change of state was a birth. 
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(f) The conditional probability that the first observed transition is a birth given that 
it resulted in a change of state. 


(g) The conditional probability that the first observed transition leads to state 2, 
given that it resulted in a change of state. 


Problem 14. Consider a Markov chain with given transition probabilities and with a 
single recurrent class that is aperiodic. Assume that for n > 500, the n-step transition 
probabilities are very close to the steady-state probabilities. 


(a) Find an approximate formula for P(Xiooo = j, X1001 = k, X2000 = ! | Xo = i). 


(b) Find an approximate formula for P(X1000 = i| X1001 = j). 


Problem 15. Ehrenfest model of diffusion. We have a total of n balls, some of 
them black, some white. At each time step, we either do nothing, which happens with 
probability e, where 0 « € < 1, or we select a ball at random, so that each ball has 
probability (1 — €)/n > 0 of being selected. In the latter case, we change the color of 
the selected ball (if white it becomes black, and vice versa), and the process is repeated 
indefinitely. What is the steady-state distribution of the number of white balls? 


Problem 16. Bernoulli-Laplace model of diffusion. Each of two urns contains 
m balls. Out of the total of the 2m balls, m are white and m are black. A ball is 
simultaneously selected from each urn and moved to the other urn, and the process 
is indefinitely repeated. What is the steady-state distribution of the number of white 
balls in each urn? 


Problem 17. Consider a Markov chain with two states denoted 1 and 2, and transition 
probabilities 


pi-l-a, Piz =Q, 
pa = B, po = 1-8, 
where a and f are such that 0 « a « 1 and 0 « B <1. 
(a) Show that the two states of the chain form a recurrent and aperiodic class. 


(b) Use induction to show that for all n, we have 


B a(l — a — p)” 














UU aap. aap. A eg a 
| 8  B(ü-a-fy . “é..,p0=a=6° 
MM eg. ee ae ae 


(c) What are the steady-state probabilities mı and 72? 


Problem 18. The parking garage at MIT has installed a card-operated gate, which, 
unfortunately, is vulnerable to absent-minded faculty and staff. In particular, in each 
day, a car crashes the gate with probability p, in which case a new gate must be 
installed. Also a gate that has survived for m days must be replaced as a matter of 
periodic maintenance. What is the long-term expected frequency of gate replacements? 
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Problem 19.* Steady-state convergence. Consider a Markov chain with a single 
recurrent class, and assume that there exists a time 7 such that 


Tij (n) 0, 


for all ? and all recurrent j. (This is equivalent to assuming that the class is aperiodic.) 
We wish to show that for any i and j, the limit 


lim rij(n) 
n—oc 


exists and does not depend on i. To derive this result, we need to show that the choice of 
the initial state has no long-term effect. To quantify this effect, we consider two different 
initial states 7 and k, and consider two independent Markov chains, Xn and Yn, with 
the same transition probabilities and with Xo = i, Yo = k. Let T = min(n| Xn = Yn} 
be the first time that the two chains enter the same state. 


(a) Show that there exist positive constants c and y < 1 such that 


P(T > n) < cy”. 


(b) Show that if the states of the two chains became equal by time n, their occupancy 
probabilities at time n are the same, that is, 


P(X, =j|T <n) =P(¥, 2 jIT € n). 


(c) Show that |rij(n) — rkj;(n)| € cy”, for all i, j, k, and n. Hint: Condition on the 
two events {T > n) and (T € n). 


(d) Let q} (n) = max; ri; (n) and q7 (n) = mini ri;(n). Show that 
qj (n) € 4; (n - 1) € gj (n +1) € qj (n), for all n. 


(e) Show that the sequence ri;(n) converges to a limit that does not depend on i. 
Hint: Combine the results of parts (c) and (d) to show that the two sequences 
q; (n) and qj (n) converge and have the same limit. 


Solution. (a) The argument is similar to the one used to bound the PMF of the time 
until a recurrent state is entered (Problem 7). Let l be some recurrent state and let 
B = minirü(n) > 0. No matter what is the current state of Xn and Yn, there is 
probability of at least 8? that both chains are at state l after 7 time steps. Thus, 


P(T >7) <1-". 


Similarly, 
P(T > 27) = P(T >”) P(T > 2n|T > 7) < (1-8). 


and 
P(T > kn) < (1 - B’)*. 


388 Markov Chains Chap. 7 
This implies that 
P(T > n) < cy’, 


where y = (1 — 8?)!/^, and c = 1/(1— pg. 


(b) We condition on the possible values of T and on the common state l of the two 
chains at time T', and use the total probability theorem. We have 


P(Xn =j|T Sn) =>) Y P(Xn =5|T = tX: )P(T = t, Xe = LT <n) 


t=0 l=1 


=o POS = 5/ Xe =I) PIT = t, Xt = LT <n) 


t=0 l=1 


2573 uc oPT ex ediTsn) 


t=0 l=1 
Similarly, 


n m 


P(Y, 2j|T <n)= 3M ry(n-t)P(T =4,% 2 l|T <n). 


1-0 [=] 


But the events {T = t, X, = l} and (T = t, Y; = l} are identical, and therefore have 
the same probability, which implies that P( X4 =j|T € n) = P(Y, =j|T € n). 


(c) We have 

rin) 2P(X,-2j)2P(X.,2j3|T€n)P(T€Xm) + P(X, 2 jI|T > n) P(T > n) 
and 

rkj(n) = P(Yn = j) = P(Y 2 jI|IT €x n) (T € n) + P(Y 2 j| T » n) P(T >n). 
By subtracting these two equations, using the result of part (b) to eliminate the first 


terms in their right-hand sides, and by taking the absolute value of both sides, we 
obtain 


Iri (n) — (n)] € [P(X« 2 j| T >n) P(T > n) - P(Y =j |T >n) P(T > 0) 
« P(T »n) 
€ cy". 


(d) By conditioning on the state after the first transition, and using the total probability 
theorem, we have the following variant of the Chapman-Kolmogorov equation: 


rij(n + 1) = 3 pieres (n). 
k=1 
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Using this equation, we obtain 
m m 
qj (n +1)= max ri, (n+1)= max Y pisrej(n) < max Y pia (n) = 3 (n). 
k=l k=l 


The inequality q; (n) < q; (n + 1) is established by a symmetrical argument. The 
inequality q7 (n +1) < qj (n + 1) is a consequence of the definitions. 


(e) The sequences qj (n) and q H (n) converge because they are monotonic. The inequal- 
ity |rij(n) — rk; (n)| € ey”, for all ? and k, implies that q} (n) — q7 (n) € cy”. Taking 
the limit as n — oo, we obtain that the limits of q; (n) and q; (n) are the same. Let 
Tj denote this common limit. Since qj (n) < r; (n) € qj (n), it follows that ri; (n) also 
converges to 7;, and the limit is independent of i. 


Problem 20.* Uniqueness of solutions to the balance equations. Consider a 
Markov chain with a single recurrent class, plus possibly some transient states. 


(a) Assuming that the recurrent class is aperiodic, show that the balance equations 
together with the normalization equation have a unique nonnegative solution. 
Hint: Given a solution different from the steady-state probabilities, let it be the 
PMF of Xo and consider what happens as time goes to infinity. 


(b) Show that the uniqueness result of part (a) is also true when the recurrent class is 
periodic. Hint: Introduce self-transitions in the Markov chain, in a manner that 
results in an equivalent set of balance equations, and use the result of part (a). 


Solution. (a) Let 71,...,7,4 be the steady-state probabilities, that is, the limits of the 
rij(n). These satisfy the balance and normalization equations. Suppose that there is 
another nonnegative solution 71,...,7 4. Let us initialize the Markov chain according 
to these probabilities, so that P(Xo = j) = 7; for all j. Using the argument given in 
the text, we obtain P( X4 = j) = 7;, for all times. Thus, 


r= im P(Xn = j) 
Jim 3 mns(n) 
k=] 


m 
= , Tk7j 
k=l 


= Tj. 


(b) Consider a new Markov chain, whose transition probabilities Pij are given by 
Pa = (1 — a)pii +a, Pi; = (1 - a)pij, J£i. 


Here, a is a number satisfying 0 < a < 1. The balance equations for the new Markov 
chain take the form 


Tj-— m((1 — a)pij +a) + Yom — a)Pij, 
ixj 
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or 


(1 - a)r; = (12 9) È Tipis. 
i=l 


These equations are equivalent to the balance equations for the original chain. Notice 
that the new chain is aperiodic, because self-transitions have positive probability. This 
establishes uniqueness of solutions for the new chain, and implies the same for the 
original chain. 


Problem 21.* Expected long-term frequency interpretation. Consider a Markov 
chain with a single recurrent class which is aperiodic. Show that 
. Vvij(n g ds 
7j = lim taln). for all i, j = 1,...,m, 
n—oo n 

where the 7; are the steady-state probabilities, and vij(n) is the expected value of the 
number of visits to state j within the first n transitions, starting from state i. Hint: 
Use the following fact from analysis. If a sequence a4 converges to a number a, the 
sequence b, defined by b, = (1/n) 7, , ax also converges to a. 


Solution. We first assert that for all n, i, and j, we have 


n 


vij(n) = 9 ^ ru(k). 


k=1 


To see this, note that 
vij(n) = E Soh | Xo —i|, 
k=1 


where 7, is the random variable that takes the value 1 if X, = j, and the value 0 


otherwise, so that 
E[Jk | Xo = i] = rij(k). 


Since 


vij(n) _ - 3 rat) 


TL 
k=1 


and ri;(k) converges to 7;, it follows that vi;(n)/n also converges to mj, which is the 
desired result. 

For completeness, we also provide the proof of the fact given in the hint, and 
which was used in the last step of the above argument. Consider a sequence a4 that 
converges to some a, and let b, — (1/m) Sm ax. Fix some e > 0. Since an converges 
to a, there exists some no such that ax € a--(e/2), for all k > no. Let also c = max, ax. 
We then have 





no n 

_i1 1 no n — no € 

orba Dy aT. 7 (ats). 
k=1 k=no+1 


The limit of the right-hand side, as n tends to infinity, is a + (€/2). Therefore, there 
exists some nı such that b, < a + e, for every n > nı. By a symmetrical argument, 
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there exists some n2 such that bn > a — e, for every n > na. We have shown that for 
every € > 0, there exists some n3 (namely, n3 = max(ni, n2)) such that |b, — a| < e, 
for all n > n3. This means that bn converges to a. 


Problem 22.* Doubly stochastic matrices. Consider a Markov chain with a 
single recurrent class which is aperiodic, and whose transition probability matrix is 
doubly stochastic, i.e., it has the property that the entries in any column (as well as 
in any row) add to unity, so that 


m 
X pj = 1, j=l,...,m. 
i=l 


(a) Show that the transition probability matrix of the chain in Example 7.7 is doubly 
stochastic. 


(b) Show that the steady-state probabilities are 
; j=l,...,m. 


(c) Suppose that the recurrent class of the chain is instead periodic. Show that 


Tı = c: = Tm = l/m is the unique solution to the balance and normalization 
equations. Discuss your answer in the context of Example 7.7 for the case where 
m is even. 


Solution. (a) Indeed the rows and the columns of the transition probability matrix in 
this example all add to 1. 


(b) We have 
Y d oar 
- Pij = — 


Thus, the given probabilities 7; = 1/m satisfy the balance equations and must therefore 
be the steady-state probabilities. 


(c) Let (m1,..., Tm) bea solution to the balance and normalization equations. Consider 
a particular j such that 7; > m; for all à, and let q = rj. The balance equation for 


state j yields 
m m 
q—7j-— nos < qY Pi =q, 
i=l i=l 


where the last step follows because the transition probability matrix is doubly stochas- 
tic. It follows that the above inequality is actually an equality and 


m m 
»3 Tipij = 5 qPij- 
1=1 i=] 


Since m; < q for all i, we must have 7:p,; = qpi; for every i, Thus, 7; = q for every 
state i from which a transition to j is possible. By repeating this argument, we see that 
Tj = q for every state z such that there is a positive probability path from i to j. Since 
all states are recurrent and belong to the same class, all states 7 have this property, 
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and therefore m, is the same for all i. Since the v; add to 1, we obtain 71 = 1/m for 
all i. 

If m is even in Example 7.7. the chain is periodic with period 2. Despite this 
fact, the result we have just established shows that 7; — 1/m is the unique solution to 
the balance and normalization equations. 


Problem 23.* Queueing. Consider the queueing Example 7.9, but assume that the 
probabilities of a packet arrival and a packet transmission depend on the state of the 
queue. In particular, in each period where there are ¿i packets in the node, one of the 
following occurs: 


(i) one new packet arrives: this happens with a given probability b;. We assume that 
b, > 0 for i < m. and bm = 0. 


(ii) one existing packet completes transmission; this happens with a given probability 
di » 0 if 2 1, and with probability 0 otherwise; 


(iii) no new packet arrives and no existing packet completes transmission; this hap- 
pens with probability 1 — b; — d; if 2 > 1, and with probability 1 — b; if i = 0. 
Calculate the steady-state probabilities of the corresponding Markov chain. 
Solution. We introduce a Markov chain where the states are 0, 1,... , m, and correspond 


to the number of packets currently stored at the node. The transition probability graph 
is given in Fig. 7.23. 


zd 


m-l m-1 TH 


1-5 





Figure 7.23: Transition probability graph for Problem 23. 


Similar to Example 7.9, we write down the local balance equations, which take 
the form 
Tb, = Wi41Gi41. 1—0.1 EE m -— 1. 


Thus we have 7Ti«1 = Pini, where 





pi — I 
to dias 
Hence m; = (po:-: pi-1)zo for i = 1,...,m. By using the normalization equation 


1=70 +7, +--+ Tm, we obtain 
1 = mo(1 + po + popi t: posi Pma), 


from which i 


$9 =. 
1+ po + popi t: poss Pm- 
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The remaining steady-state probabilities are 


UR LT A 
mare a e DEI 


1 + po popi +++: + po: Pm- 


Problem 24.* Dependence of the balance equations. Show that if we add the 
first m — 1 balance equations 7; = pad 1 TkPkj, for j = 1,...,m— 1, we obtain the 
last equation nm = Y pı TkPkm. 


Solution. By adding the first m — 1 balance equations, we obtain 


m-1 m 
y» SLT 


j=l j=1 k=1 


m-l 


m-l 


7 


1k (1 — pkm) 


m-l m 
= Tm + J Tk — J TkDkm- 
k=1 k=1 


This equation is equivalent to the last balance equation Tm = pum TkDkm- 


Problem 25.* Local balance equations. We are given a Markov chain that has 
a single recurrent class which is aperiodic. Suppose that we have found a solution 
71,...,714 to the following system of local balance and normalization equations: 


Tipij = TjDji 1,3 =1,..., m, 
m 
) mi =l, i1-—1,..,m. 
izl 


(a) Show that the 7, are the steady-state probabilities. 


(b) What is the interpretation of the equations mipi; = 7;jpji in terms of expected 
long-term frequencies of transitions between i and j? 


(c) Construct an example where the local balance equations are not satisfied by the 
steady-state probabilities. 


Solution. (a) By adding the local balance equations mipi; = 7;pji over i, we obtain 


m m 
> Tipij = ] Tjpji = Tj 
1=1 i=1 


so the 7; also satisfy the balance equations. Therefore, they are equal to the steady- 
state probabilities. 
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(b) We know that 7jpi; can be interpreted as the expected long-term frequency of 
transitions from i to 7, so the local balance equations imply that the expected long- 
term frequency of any transition is equal to the expected long-term frequency of the 
reverse transition. (This property is also known as time reversibility of the chain.) 


(c) We need a minimum of three states for such an example. Let the states be 1,2,3, 
and let pi2 > 0, pı3 > 0, pai > 0, ps2 > 0, with all other transition probabilities being 
0. The chain has a single recurrent aperiodic class. The local balance equations do 
not hold because the expected frequency of transitions from 1 to 3 is positive, but the 
expected frequency of reverse transitions is 0. 


Problem 26.* Sampled Markov chains. Consider a Markov chain X5 with tran- 
sition probabilities p;;, and let ri;(n) be the n-step transition probabilities. 


(a) Show that for all n > 1 and | > 1, we have 


m 


ri(n +l) = 2 rik(n)rk;(L). 


k=1 


(b) Suppose that there is a single recurrent class, which is aperiodic. We sample the 
Markov chain every l transitions, thus generating a process Yn, where Yn = Xin. 
Show that the sampled process can be modeled by a Markov chain with a single 
aperiodic recurrent class and transition probabilities r;;(1). 


(c) Show that the Markov chain of part (b) has the same steady-state probabilities 
as the original process. 
Solution. (a) We condition on Xn and use the total probability theorem. We have 


rij(n +l) = P(Xnsi = j| Xo = i) 


= V P(Xn = k| Xo = i P(Xnet = j| Xn = k, Xo = i) 
k=1 


= SO P(Xn =k | Xo = i) P(Xn4t = j| Xn = k) 


= So ra(n)rs;(l), 


where in the third equality we used the Markov property. 


(b) Since Xn is Markov, once we condition on Xin, the past of the process (the states Xx 
for k < In) becomes independent of the future (the states Xx for k >In). This implies 
that given Yn, the past (the states Y; for k < n) is independent of the future (the states 
Yk for k > n). Thus, Yn has the Markov property. Because of our assumptions on Xn, 
there is a time 7 such that 


P(X, = j| Xo = i) > 0, 


for every n > 7, every state i, and every state 7 in the single recurrent class R of the 
process X4. This implies that 


P(Y, = j | Yo = i) > 0, 
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for every n 2 m, every i, and every j € R. Therefore, the process Y, has a single 
recurrent class, which is aperiodic. 


(c) The n-step transition probabilities ri;(n) of the process Xn converge to the steady- 
state probabilities 7;. The n-step transition probabilities of the process Y, are of the 
form rij(in), and therefore also converge to the same limits rj. This establishes that 
the 7; are the steady-state probabilities of the process Yn. 


Problem 27.* Given a Markov chain X; with a single recurrent class which is aperi- 
odic, consider the Markov chain whose state at time n is (X41, Xn). Thus, the state 
in the new chain can be associated with the last transition in the original chain. 


(a) Show that the steady-state probabilities of the new chain are 
Tij = Tipij, 


where the 7; are the steady-state probabilities of the original chain. 


(b) Generalize part (a) to the case of the Markov chain (X4 x, Xs k41,..., Xn), 
whose state can be associated with the last k transitions of the original chain. 


Solution. (a) For every state (i, j) of the new Markov chain, we have 
P((Xn-1, Xn) = (4j)) = P(X«-12 i) P(Xn 2 j| Xn-1 = i) = P(Xa-1 = 4) pi. 


Since the Markov chain X has a single recurrent class which is aperiodic, P(Xn-1 = 
converges to the steady-state probability 7;, for every i. It follows that P ((x n-1,X Xn) 


i) 


(4, j )) converges to 7;pi;, which is therefore the steady-state probability of (i, j). 


(b) Using the multiplication rule, we have 
P((X4-4,..., Xn) = (io... ,i4)) = P(Xa-& = do) Pig 7 Puy aig: 


Therefore, by an argument similar to the one in part (a), the steady-state probability 
of state (io, ..., ik) is equal to Tiopigi, ^^ Pi, yi, 


SECTION 7.4. Absorption Probabilities and Expected Time to Ab- 
sorption 


Problem 28. There are m classes offered by a particular department, and each year, 
the students rank each class from 1 to rn, in order of difficulty, with rank m being the 
highest. Unfortunately. the ranking is completely arbitrary. In fact, any given class is 
equally likely to receive any given rank on a given year (two classes may not receive 
the same rank). A certain professor chooses to remember only the highest ranking his 
class has ever gotten. 


(a) Find the transition probabilities of the Markov chain that models the ranking 
that the professor remembers. 


(b) Find the recurrent and the transient states. 


(c) Find the expected number of years for the professor to achieve the highest ranking 
given that in the first year he achieved the ith ranking. 
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Figure 7.24: Transition probability graph for Problem 29. 


Problem 29. Consider the Markov chain specified in Fig. 7.24. The steady-state 
probabilities are known to be: 


6 9 6 10 
Tı = — T9 = —. T3 = Z. T4 


31 31 31 TUMD 


Assume that the process is in state 1 just before the first transition. 


(a) What is the probability that the process will be in state 1 just after the sixth 
transition? 


(b) Determine the expected value and variance of the number of transitions up to 
and including the next transition during which the process returns to state 1. 


(c) What is (approximately) the probability that the state of the system resulting 
from transition 1000 is neither the same as the state resulting from transition 
999 nor the same as the state resulting from transition 1001? 


Problem 30. Consider the Markov chain specified in Fig. 7.25. 





Figure 7.25: Transition probability graph for Problem 30. 


(a) Identify the transient and recurrent states. Also. determine the recurrent classes 
and indicate which ones. if any are periodic. 


(b) Do there exist steady-state probabilities given that the process starts in state 1? 
If so, what are they? 


(c) Do there exist steady-state probabilities given that the process starts in state 6? 
If so, what are they? 
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(d) Assume that the process starts in state 1 but we begin observing it after it reaches 
steady-state. 


(i) Find the probability that the state increases by one during the first transi- 
tion we observe. 


(ii) Find the conditional probability that the process was in state 2 when we 
started observing it, given that the state increased by one during the first 
transition that we observed. 


(iii) Find the probability that the state increased by one during the first change 
of state that we observed. 


(e) Assume that the process starts in state 4. 


(i) For each recurrent class, determine the probability that we eventually reach 
that class. 


(ii) What is the expected number of transitions up to and including the tran- 
sition at which we reach a recurrent state for the first time? 


Problem 31.* Absorption probabilities. Consider a Markov chain where each 
state is either transient or absorbing. Fix an absorbing state s. Show that the prob- 
abilities a; of eventually reaching s starting from a state i are the unique solution to 
the equations 


a; = 1, 

ai = 0, for all absorbing i Æ s, 
m 

ai = 3 pisay, for all transient i. 
j=l 


Hint: If there are two solutions, find a system of equations that is satisfied by their 
difference, and look for its solutions. 


Solution. The fact that the a, satisfy these equations was established in the text, using 
the total probability theorem. To show uniqueness, let @; be another solution, and let 
6; = @; — ai. Denoting by A the set of absorbing states and using the fact 6; = 0 for 
all j € A, we have 


m 
ôi = X pijôj = 3 pijôj, for all transient 2. 
j=l j€A 


Applying this relation m successive times, we obtain 


6; = » Dij 5 Piili’ 5 Pim-1jm : Ojm- 


ji€A j2€ A jme A 
Hence 
|ô: < 5 Dij 5 Dji' ye Pim-1jm ` [Sinn | 
j1€A j2€A im€¢A 


< P(X € A... Xm € A| Xo = i)  mexló]. 
J 
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The above relation holds for all transient i, so we obtain 
max |6;| € 8 - max ó;]; 
jgA jgA 


where 
B=P(X € A,...,Xm € A| Xo =i). 


Note that @ « 1, because there is positive probability that X,, is absorbing, regardless 
of the initial state. It follows that max; 4A |ó;| = 0, or a; = @, for all i that are not 
absorbing. We also have a; = aj for all absorbing j, so a; = ai for all i. 


Problem 32.* Multiple recurrent classes. Consider a Markov chain that has 
more that one recurrent class, as well as some transient states. Assume that all the 
recurrent classes are aperiodic. 


(a) For any transient state i, let a;(k) be the probability that starting from 4 we 
will reach a state in the kth recurrent class. Derive a system of equations whose 
solution are the a;(k). 


(b) Show that each of the n-step transition probabilities ri;(n) converges to a limit, 
and discuss how these limits can be calculated. 


Solution. (a) We introduce a new Markov chain that has only transient and absorbing 
states. The transient states correspond to the transient states of the original, while 
the absorbing states correspond to the recurrent classes of the original. The transition 
probabilities pij of the new chain are as follows: if i and j are transient, pi; = pij; if i is 
a transient state and k corresponds to a recurrent class, pix is the sum of the transition 
probabilities from 2 to states in the recurrent class in the original Markov chain. 

The desired probabilities a;(k) are the absorption probabilities in the new Markov 
chain and are given by the corresponding formulas: 


ai(k) = pix + X pija; (k), for all transient i. 


j: transient 


(b) If i and j are recurrent but belong to different classes, ri;(n) is always 0. If i 
and j are recurrent but belong to the same class, ri;(n) converges to the steady-state 
probability of j in a chain consisting of only this particular recurrent class. If j is 
transient, ri;(n) converges to 0. Finally, if i is transient and j is recurrent, then ri;(n) 
converges to the product of two probabilities: (1) the probability that starting from i 
we will reach a state in the recurrent class of j, and (2) the steady-state probability of 
j conditioned on the initial state being in the class of 7. 


Problem 33.* Mean first passage times. Consider a Markov chain with a single 
recurrent class, and let s be a fixed recurrent state. Show that the system of equations 


ts = 0, ti = 1+ > pists, for all i Z s, 
j=l 


satisfied by the mean first passage times, has a unique solution. Hint: If there are two 
solutions, find a system of equations that is satisfied by their difference, and look for 
its solutions. 
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Solution. Let t; be the mean first passage times. These satisfy the given system of 
equations. To show uniqueness, let t; be another solution. Then we have for all i zs 


t=1+ puto — G1 V pul; 


js j*s 


ĝi = X pisd;, 


j#s 


and by subtraction, we obtain 


where 6; = t; — ti. By applying m successive times this relation, if follows that 


ôi = X Dij »» Pj\j2°°° 5 Pim-1im ' Ójm- 


js j2*s jm*s 


Hence, we have for all i # s 


lbs mx Pus. Pim-iim -max [à;] 


dés — dass jm*s 


—P(Xizs,...,Xm £ s| Xo = i) - max|ój|. 
j 


On the other hand, we have P(X, # s,...,Xm # s|Xo = i) < 1. This is because 
starting from any state there is positive probability that s is reached in m steps. It 
follows that all the ó; must be equal to zero. 


Problem 34.* Balance equations and mean recurrence times. Consider a 
Markov chain with a single recurrent class, and let s be a fixed recurrent state. For 
any state 2, let 


p: = E [Number of visits to i between two successive visits to s], 


where by convention, ps = 1. 


(a) Show that for all i, we have 
pi = >A PkPki- 
k=1 


(b) Show that the numbers 


_ pi 


EP i-—]1,...,m, 


Ti 
sum to 1 and satisfy the balance equations, where t; is the mean recurrence time 
of s (the expected number of transitions up to the first return to s, starting 
from s). 


(c) Show that if 71,...,7: are nonnegative, satisfy the balance equations, and sum 
to 1, then 


—, ift is recurrent, 
i 


Ti = 
0, ifi is transient, 
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where t; is the mean recurrence time of i. 


(d) Show that the distribution of part (b) is the unique probability distribution that 
satisfies the balance equations. 


Note: This problem not only provides an alternative proof of the existence and unique- 
ness of probability distributions that satisfy the balance equations, but also makes an 
intuitive connection between steady-state probabilities and mean recurrence times. The 
main idea is to break the process into "cycles," with a new cycle starting each time that 
the recurrent state s is visited. The steady-state probability of s can be interpreted as 
the long-term expected frequency of visits to state s, which is inversely proportional to 
the average time between consecutive visits (the mean recurrence time); cf. part (c). 
Furthermore, if a state į is expected to be visited, say, twice as often as some other 
state j during a typical cycle, it is plausible that the long-term expected frequency 7; 
of state i will be twice as large as mj. Thus, the steady-state probabilities m; should be 
proportional to the expected number of visits p; during a cycle; cf. part (b). 


Solution. (a) Consider the Markov chain Xn, initialized with Xo = s. We claim that 


for all 2 
oo 


pi 2 M PUG #8,...,Xn-1 #8, Xn = i). 


n=1 


To see this, we first consider the case i Æ s, and let J, be the random variable that takes 
the value 1 if X1 Æ s,...,Xn—-1 Æ s, and Xn = i, and the value 0 otherwise. Then, the 


number of visits to state 7 before the next visit to state s is equal to pau Ta: Thus,! 


ye = Y Ein) -Y Pu Æ S,..., Xn-1 Æ S, Xn mm). 
n=1 n=1 n=1 


1 The interchange of the infinite summation and the expectation in the subsequent 
calculation can be justified by the following argument. We have for any k » 0, 


n| =E = P» +E 5: In = Bh} +E Y In 


n=k+1 n=k+1 


Let T be the first positive time that state s is visited. Then, 


Y In| = Y P(T — t)E Y In| T=tl< Y tP(T = 
n=k+1 


n=k+1 t=k+2 t=k+2 





Since the mean recurrence time $77- ,tP(T = t) is finite, the limit, as k — oo of 
we pet P(T = t) is equal to zero. We take the limit of both sides of the earlier 
equation, as k — oo, to obtain the desired relation 


oo oc 


So In| 2M EU. 


zl n= 
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When i — s, the events 
{Xi #5,...,Xn-1 # s X4 = s) 


for the different values of n, form a partition of the sample space, because they corre- 
spond to the different possibilities for the time of the next visit to state s. Thus, 


2c 


y P(X F Saner Xnr AS Antes) = l= ps; 


n=1 


which completes the verification of our assertion. 
We next use the total probability theorem to write for n > 2, 


P(Xi #5,...,Xn-1 #8,Xn =i) = M PQG F58,...,Xn-2 # $, Xn-1 = k)pri. 
kzs 


We thus obtain 


pi = $ P(Xi sss Xni Es Xn c i) 


n=1 


20 
= pit P(X E &,...,Xn-1 £ 8, Xn = i) 


n=2 


oc 
=pait I P(Xi * 5. Xn-2 5 Xn-1 = k)pri 


n=2 k#s 


oo 
= pei + pa > P(X Fx 8,...,Xn-2 £ S, Xn-1 = k) 


k#s n=2 


(b) Dividing both sides of the relation established in part (a) by t2, we obtain 


m 
Ti = ) TkDki; 
k=1 


where 7; = pi/t;. Thus, the 7; solve the balance equations. Furthermore, the 7; are 
nonnegative, and we clearly have 3 icd pi = t$ or ROC mi = 1. Hence, (71,..., Tm) is 
a probability distribution. 


(c) Consider a probability distribution (71,...,7,5) that satisfies the balance equations. 
Fix a recurrent state s, let t? be the mean recurrence time of s. and let t; be the mean 
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first passage time from a state i # s to state s. We will show that 7,t3 = 1. Indeed, 


we have 
6-14 pst; 
j#s 


ti = 1+ M pists, for all 4 z s. 
j#s 


Multiplying these equations with 7, and 7;, respectively, and adding, we obtain 


Tots + So mit =1]+ a 3 pists. 


ifs i=] j#s 


By using the balance equations, the right-hand side is equal to 


1+ S mn pists — 1+ Sot) nip; =1 tM ty. 


i=1 j#s j#s i=l j#s 


By combining the last two equations, we obtain mst; = 1. 

Since the probability distribution (7,...,7m) satisfies the balance equations, if 
the initial state Xo is chosen according to this distribution, all subsequent states Xn 
have the same distribution. If we start at a transient state ?, the probability of being 
at that state at time n diminishes to 0 as n — oco. It follows that we must have 7; = 0. 


(d) Part (b) shows that there exists at least one probability distribution that satisfies 
the balance equations. Part (c) shows that there can be only one such probability 
distribution. 


Problem 35.* The strong law of large numbers for Markov chains. Consider 
a finite-state Markov chain in which all states belong to a single recurrent class which 
is aperiodic. For a fixed state s, let Y, be the time of the kth visit to state s. Let also 
V, be the number of visits to state s during the first n transitions. 


(a) Show that Y,/k converges with probability 1 to the mean recurrence time t; of 
state s. 


(b) Show that V, /n converges with probability 1 to 1/t;. 
(c) Can you relate the limit of V,/n to the steady-state probability of state s? 


Solution. (a) Let us fix an initial state i, not necessarily the same as s. Thus, the 
random variables Y,41 — Yk, for k > 1, correspond to the time between successive visits 
to state s. Because of the Markov property (the past is independent of the future, 
given the present), the process "starts fresh" at each revisit to state s and, therefore, 
the random variables Y;,41 — Yk are independent and identically distributed, with mean 
equal to the mean recurrence time t}. Using the strong law of large numbers, we obtain 


lim Ý = tim Xa im VW t Yen) _ 9 


+ ti, 


with probability 1. 
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(b) Let us fix an element of the sample space (a trajectory of the Markov chain). Let 
yx and v, be the values of the random variables Yx and Vn, respectively. Furthermore, 
let us assume that the sequence yx /Kk converges to t3; according to the result of part 
(a), the set of trajectories with this property has probability 1. Let us consider some 
n between the time of the kth visit to state s and the time just before the next visit to 
that state: 

yk € n € Yee 


For every n in this range, we have vn = k, and also 





1 « 1 < = 
Yk+i1 n Yk 
from which we obtain k k 
EEA Un < 
Yk+l n Yk 
Note that 
k+1 . k ] k 1 


= lim : lim — lim - 
k—o0o Yk+1 k—oo Yk+1  k—oo k +1 k—oo Yk t; 








If we now let n go to infinity, the corresponding values of k, chosen to satisfy yk < n < 
Vk41 also go to infinity. Therefore, the sequence vn/n is between two sequences both 
of which converge to 1/t;, which implies that the sequence vn /n converges to 1/t; as 
well. Since this happens for every trajectory in a set of trajectories that has probability 
equal to 1, we conclude that V, /n converges to 1/t?, with probability 1. 


(c) We have 1/t; = Ts, as established in Problem 35. This implies the intuitive result 
that V4 /n converges to 7,, with probability 1. Note: It is tempting to try to establish 
the convergence of V,/n to T, by combining the facts that V, /n converges [part (b)] 
together with the fact that E[V,]/n converges to 7, (cf. the long-term expected fre- 
quency interpretation of steady-state probabilities in Section 7.3). However, this line 
of reasoning is not valid. This is because it is generally possible for a sequence Yn of 
random variables to converge with probability 1 to a constant, while the expected val- 
ues converge to a different constant. An example is the following. Let X be uniformly 
distributed in the unit interval [0, 1]. let 

vii s if X >1/n, 

s n, if X « 1/n. 

As long as X is nonzero (which happens with probability 1), the sequence Y, converges 
to zero. On the other hand, it can be seen that 


E|Y.] = P(X < 1/2) EY. | X < 1/n] = L.R z for all n. 


SECTION 7.5. Continuous-Time Markov Chains 


Problem 36. A facility of m identical machines is sharing a single repairperson. The 
time to repair a failed machine is exponentially distributed with mean 1/A. A machine 
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once operational, fails after a time that is exponentially distributed with mean 1/p. 
All failure and repair times are independent. 


(a) Find the steady-state probability that there is no operational machine. 


(b) Find the expected number of operational machines, in steady-state. 


Problem 37. Empty taxis pass by a street corner at a Poisson rate of two per minute 
and pick up a passenger if one is waiting there. Passengers arrive at the street corner 
at a Poisson rate of one per minute and wait for a taxi only if there are less than four 
persons waiting; otherwise they leave and never return. Penclope arrives at the street 
corner at a given time. Find her expected waiting time, given that she joins the queue. 
Assume that the process is in steady-state. 


Problem 38. There are m users who share a computer system. Each user al- 
ternates between "thinking" intervals whose durations are independent exponentially 
distributed with parameter A, and an "active" mode that starts by submitting a ser- 
vice request. The server can only serve one request at a time, and will serve a request 
completely before serving other requests. The service times of different requests are 
independent exponentially distributed random variables with parameter pu, and also 
independent of the thinking times of the users. Construct a Markov chain model and 
derive the steady-state distribution of the number of pending requests, including the 
one presently served, if any. 


Problem 39.* Consider a continuous-time Markov chain in which the transition rates 
vi are the same for all i. Assume that the process has a single recurrent class. 
(a) Explain why the sequence Y; of transition times form a Poisson process. 


(b) Show that the steady-state probabilities of the Markov chain X(t) are the same 
as the steady-state probabilities of the embedded Markov chain Xn. 


Solution. (a) Denote by v the common value of the transition rates v;. The sequence 
{Ya} is a sequence of independent exponentially distributed time intervals with param- 
eter v. Therefore they can be associated with the arrival times of a Poisson process 
with rate v. 


(b) The balance and normalization equations for the continuous-time chain are 


Tj Qik = M im; j=1,...,m, 


kAj k#) 
m 

1 = m. 
k=1 


By using the relation gj. = vpjx, and by canceling the common factor v, these equations 


are written as 
T; Pie = 3 TkPri» j—1l,:m: 
kzj kžj 


1 SS ae 
k=1 
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We have $^, 4; Pik = 1 — pjj, So the first of these two equations is written as 
3(1 — pij) =)" TkDkj; 
kj 


or 
m 
= ) TkDkj. j=1,... m. 
k=1 


These are the balance equations for the embedded Markov chain, which have a unique 
solution since the embedded Markov chain has a single recurrent class, which is aperi- 
odic. Hence the 7; are the steady-state probabilities for the embedded Markov chain. 


Bayesian Statistical Inference 


Contents 


. Bayesian Inference and the Posterior Distribution 
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407 


408 Bayesian Statistical Inference Chap. 8 


Statistical inference is the process of extracting information about an unknown 
variable or an unknown model from available data. In this chapter and the next 
one we aim to: 


(a) Develop an appreciation of the main two approaches (Bayesian and classi- 
cal), their differences, and similarities. 


(b) Present the main categories of inference problems (parameter estimation, 
hypothesis testing, and significance testing). 


(c) Discuss the most important methodologies (maximum a posteriori proba- 
bility rule, least mean squares estimation, maximum likelihood, regression, 
likelihood ratio tests, etc.). 


(d) Illustrate the theory with some concrete examples. 
Probability versus Statistics 


Statistical inference differs from probability theory in some fundamental ways. 
Probability is a self-contained mathematical subject, based on the axioms in- 
troduced in Chapter 1. In probabilistic reasoning, we assume a fully specified 
probabilistic model that obeys these axioms. We then use mathematical meth- 
ods to quantify the consequences of this model or answer various questions of 
interest. In particular, every unambiguous question has a unique correct answer, 
even if this answer is sometimes hard to find. The model is taken for granted 
and, in principle, it need not bear any resemblance to reality (although for the 
model to be useful, this would better be the case). 

Statistics is different, and it involves an element of art. For any particular 
problem, there may be several reasonable methods, yielding different answers. 
In general, there is no principled way for selecting the “best” method, unless one 
makes several assumptions and imposes additional constraints on the inference 
problem. For instance, given the history of stock market returns over the last 
five years, there is no single “best” method for estimating next year’s returns. 

We can narrow down the search for the “right” method by requiring cer- 
tain desirable properties (e.g., that the method make a correct inference when 
the number of available data is unlimited). The choice of one method over an- 
other usually hinges on several factors: performance guarantees, past experience, 
common sense, as well as the consensus of the statistics community on the ap- 
plicability of a particular method on a particular problem type. We will aim 
to introduce the reader to some of the most popular methods/choices, and the 
main approaches for their analysis and comparison. 


Bayesian versus Classical Statistics 


Within the field of statistics there are two prominent schools of thought, with op- 
posing views: the Bayesian and the classical (also called frequentist). Their 
fundamental difference relates to the nature of the unknown models or variables. 
In the Bayesian view. they are treated as random variables with known distri- 
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butions. In the classical view, they are treated as deterministic quantities that 
happen to be unknown. 

The Bayesian approach essentially tries to move the field of statistics back 
to the realm of probability theory, where every question has a unique answer. In 
particular, when trying to infer the nature of an unknown model, it views the 
model as chosen randomly from a given model class. This is done by introducing 
a random variable O that characterizes the model, and by postulating a prior 
probability distribution pe(0). Given the observed data zr, one can, in principle, 
use Bayes’ rule to derive a posterior probability distribution pg; x (0 |x). This 
captures all the information that z can provide about 0. 

By contrast, the classical approach views the unknown quantity 0 as a 
constant that happens to be unknown. It then strives to develop an estimate of 0 
that has some performance guarantees. This introduces an important conceptual 
difference from other methods in this book: we are not dealing with a single 
probabilistic model, but rather with multiple candidate probabilistic models, 
one for each possible value of 0. 

The debate between the two schools has been ongoing for about a century, 
often with philosophical overtones. Furthermore, each school has constructed 
examples to show that the methods of the competing school can sometimes 
produce unreasonable or unappealing answers. We briefly review some of the 
arguments in this debate. 

Suppose that we are trying to measure a physical constant, say the mass of 
the electron, by means of noisy experiments. The classical statistician will argue 
that the mass of the electron, while unknown, is just a constant, and that there is 
no justification for modeling it as a random variable. The Bayesian statistician 
will counter that a prior distribution simply reflects our state of knowledge. 
For example, if we already know from past experiments a rough range for this 
quantity, we can express this knowledge by postulating a prior distribution which 
is concentrated over that range. 

A classical statistician will often object to the arbitrariness of picking a par- 
ticular prior. A Bayesian statistician will counter that every statistical procedure 
contains some hidden choices. Furthermore, in some cases, classical methods 
turn out to be equivalent to Bayesian ones, for a particular choice of a prior. By 
locating all of the assumptions in one place, in the form of a prior, the Bayesian 
statistician contends that these assumptions are brought to the surface and are 
amenable to scrutiny. 

Finally, there are practical considerations. In many cases, Bayesian meth- 
ods are computationally intractable, e.g., when they require the evaluation of 
multidimensional integrals. On the other hand, with the availability of faster 
computation, much of the recent research in the Bayesian community focuses on 
making Bayesian methods practical. 


Model versus Variable Inference 


Applications of statistical inference tend to be of two different types. In model 
inference, the object of study is a real phenomenon or process for which we 
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wish to construct or validate a model on the basis of available data (e.g., do 
planets follow elliptical trajectories?). Such a model can then be used to make 
predictions about the future, or to infer some hidden underlying causes. In 
variable inference, we wish to estimate the value of one or more unknown 
variables by using some related, possibly noisy information (e.g., what is my 
current position, given a few GPS readings’). 

The distinction between model and variable inference is not sharp; for 
example, by describing a model in terms of a set of variables, we can cast a 
model inference problem as a variable inference problem. In any case, we will 
not emphasize this distinction in the sequel, because the same methodological 
principles apply to both types of inference. 

In some applications, both model and variable inference issues may arise. 
For example, we may collect some initial data, use them to build a model, and 
then use the model to make inferences about the values of certain variables. 


Example 8.1. A Noisy Channel. A transmitter sends a sequence of binary 
messages s; € (0, 1), and a receiver observes 


X; = asi + Wi, i-—]1l,...,n, 


where the W; are zero mean normal random variables that model channel noise, 
and a is a scalar that represents the channel attenuation. In a model inference 
setting, a is unknown. The transmitter sends a pilot signal consisting of a sequence 
of messages s1,...,5n, whose values are known by the receiver. On the basis of 
the observations X;,..., Xn, the receiver wishes to estimate the value of a, that is, 
build à model of the channel. 

Alternatively, in a variable inference setting, a is assumed to be known (possi- 
bly because it has already been inferred using a pilot signal, as above). The receiver 
observes X;,..., Xn. and wishes to infer the values of s1,..., Sn. 


A Rough Classification of Statistical Inference Problems 


We describe here a few different types of inference problems. In an estimation 
problem, a model is fully specified, except for an unknown, possibly multidimen- 
sional, parameter 6, which we wish to estimate. This parameter can be viewed 
as either a random variable (Bayesian approach) or as an unknown constant 
(classical approach). The usual objective is to arrive at an estimate of 0 that is 
close to the true value in some sense. For example: 


(a) In the noisy transmission problem of Example 8.1, use the knowledge of 
the pilot sequence and the observations to estimate a. 


(b) Using polling data, estimate the fraction of a voter population that prefers 
candidate A over candidate B. 


(c) On the basis of historical stock market data, estimate the mean and vari- 
ance of the daily movement in the price of a particular stock. 
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In a binary hypothesis testing problem, we start with two hypotheses 
and use the available data to decide which of the two is true. For example: 


(a) In the noisy transmission problem of Example 8.1, use the knowledge of a 
and X; to decide whether s; was 0 or 1. 


(b) Given a noisy picture, decide whether there is a person in the picture or 
not. 


(c) Given a set of trials with two alternative medical treatments, decide which 
treatment is most effective. 


More generally. in an m-ary hypothesis testing problem, there is a finite 
number m of competing hypotheses. The performance of a particular method is 
typically judged by the probability that it makes an erroneous decision. Again, 
both Bayesian and classical approaches are possible. 

In this chapter, we focus primarily on problems of Bayesian estimation, but 
also discuss hypothesis testing. In the next chapter, in addition to estimation, 
we discuss a broader range of hypothesis testing problems. Our treatment is 
introductory and far from exhausts the range of statistical inference problems 
encountered in practice. As an illustration of a different type of problem, consider 
the construction of a model of the form Y = g(.X) +W that relates two random 
variables X and Y. Here W is zero mean noise, and g is an unknown function to 
be estimated. Problems of this type, where the uncertain object (the function 
g in this case) cannot be described by a fixed number of parameters, are called 
nonparametric and are beyond our scope. 





Major Terms, Problems, and Methods in this Chapter 


e Bayesian statistics treats unknown parameters as random variables 
with known prior distributions. 


e In parameter estimation, we want to generate estimates that are 
close to the true values of the parameters in some probabilistic sense. 


e In hypothesis testing, the unknown parameter takes one of a finite 
number of values, corresponding to competing hypotheses; we want to 
choose one of the hypotheses, aiming to achieve a small probability of 
error. 


e Principal Bayesian inference methods: 


(a) Maximum a posteriori probability (MAP) rule: Out of the 
possible parameter values/hypotheses, select one with maximum 
conditional/posterior probability given the data (Section 8.2). 


(b) Least mean squares (LMS) estimation: Select an estimator/fun- 
ction of the data that minimizes the mean squared error between 
the parameter and its estimate (Section 8.3). 
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(c) Linear least mean squares estimation: Select an estimator 
which is a linear function of the data and minimizes the mean 
squared error between the parameter and its estimate (Section 
8.4). This may result in higher mean squared error, but requires 
simple calculations, based only on the means, variances, and co- 
variances of the random variables involved. 








8.1 BAYESIAN INFERENCE AND THE POSTERIOR DISTRIBUTION 


In Bayesian inference. the unknown quantity of interest, which we denote by 9, 
is modeled as a random variable or as a finite collection of random variables. 
Here, O may represent physical quantities, such as the position and velocity of a 
vehicle, or a set of unknown parameters of a probabilistic model. For simplicity, 
unless the contrary is explicitly stated, we view © as a single random variable. 

We aim to extract information about O, based on observing a collection 
KEAN COMM X4) of related random variables, called observations, measure- 
ments, or an observation vector. For this, we assume that we know the joint 
distribution of O and X. Equivalently, we assume that we know: 


(a) A prior distribution pe or fe. depending on whether O is discrete or con- 
tinuous. 


(b) A conditional distribution pxje or fxje. depending on whether X is dis- 
crete or continuous. 


Once a particular value z of X has been observed, a complete answer to the 
Bayesian inference problem is provided by the posterior distribution pe x (8 | x) or 
fejx (8| x) of O: see Fig. 8.1. This distribution is determined by the appropriate 
form of Bayes’ rule. It encapsulates everything there is to know about ©, given 
the available information. and it is the starting point for further analysis. 
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Figure 8.1. A summary of a Bayesian inference model. The starting point is 
the joint distribution of the parameter O and observation X, or equivalently the 
prior and conditional PMF/PDF. Given the value z of the observation X, the 
posterior PMF/PDF is formed using Bayes’ rule. The posterior can be used to 
answer additional inference questions; for example the calculation of estimates of 
O, and associated probabilities or error variances. 
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Summary of Bayesian Inference 


e We start with a prior distribution pe or fe for the unknown random 
variable O. 


e We have a model pxje or f xe of the observation vector X. 


e After observing the value x of X, we form the posterior distribution 
of ©, using the appropriate version of Bayes’ rule. 





Note that there are four different versions of Bayes’ rule. which we repeat 


here for easy reference. They correspond to the different combinations of discrete 
or continuous O and X. Yet. all four versions are syntactically similar: starting 
from the simplest version (all variables are discrete). we only need to replace a 
PMF with a PDF and a sum with an integral when a continuous random variable 
is involved. Furthermore, when O is multidimensional. the corresponding sums 
or integrals are to be understood as multidimensional. 







The Four Versions of Bayes’ Rule 


e O discrete, X discrete: 









aes A pe(0)px|e(x |0) 
D Epelpol) 
: 





e O discrete, X continuous: 










(0)fx|e (x | 0) 

(012) = RI 

Poxi E Y pe )fxie(z 16) 
0' 









e O continuous, X discrete: 


fe(0)px|e(x | 0) 


E I E RO 
f fo(6")pxje(« |0) d& 


e O continuous, X continuous: 


f .... fe(0)fxje(x|0) 
eix (0| x) = =. 
f fixet a 
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Let us illustrate the calculation of the posterior with some examples. 


Example 8.2. Romeo and Juliet start dating, but Juliet will be late on any 
date by a random amount X, uniformly distributed over the interval [0,6]. The 
parameter Ó is unknown and is modeled as the value of a random variable O, 
uniformly distributed between zero and one hour. Assuming that Juliet was late by 
an amount z on their first date, how should Romeo use this information to update 
the distribution of O? 

Here the prior PDF is 


if 0 « 0 « 1, 
otherwise, 


o {i 


and the conditional PDF of the observation is 


1/00, if0€r«90, 
0, otherwise. 


fxie(z10) = ( 


Using Bayes’ rule, and taking into account that fe(0) fxje(z|@) is nonzero only if 
0<2<6< 1, we find that for any z € [0,1], the posterior PDF is 


_ fe(0)fxie(z]0)) — 1 
~ 8-|logz|' 
[ fe txoen a fe(&) fxie(z 0) de fis 1 qp 


feix(0|2) = if r«6«1, 


and fejx(0| z) =0 if 0 « z or 0 » 1. 
Consider now a variation involving the first n dates. Assume that Juliet is 


late by random amounts X1,..., Xn, which given © = 0, are uniformly distributed 
in the interval [0,0], and conditionally independent. Let X = (Xi,..., Xn) and 
T = (z1,..., 5). Similar to the case where n = 1, we have 
_ fifa", ifz«0€«]1, 
fxie(2]0) = o. otherwise, 
where 
T = max{z1,...,Zn}. 


The posterior PDF is 


Str) ifz«6«1, 


feix (0| z) =| 0n ' 


0, otherwise, 


where c(Z) is a normalizing constant that depends only on z: 


"I 1 
— > a aS 

1 , 
Lar 


Sec. 8.1 Bayesian Inference and the Posterior Distribution 415 


Example 8.3. Inference of a Common Mean of Normal Random Vari- 
ables. We observe a collection X — (Xi,..., X4) of random variables, with an 
unknown common mean whose value we wish to infer. We assume that given the 
value of the common mean, the X; are normal and independent, with known vari- 
ances 02,...,02. In a Bayesian approach to this problem, we model the common 
mean as a random variable ©, with a given prior. For concreteness, we assume a 
normal prior, with known mean zo and known variance o2. 

Let us note, for future reference, that our model is equivalent to one of the 


form 
Xi; =0+W;, i—É1,...,m, 


where the random variables ©, Wi,..., Wn are independent and normal, with known 
means and variances. In particular, for any value 6, 


E[W;| = E[W; | © = 6] = 0, var(Wi) = var(X: |© = 0) = o?. 


A model of this type is common in many engineering applications involving several 
independent measurements of an unknown quantity. 
According to our assumptions, we have 


2 
fo(0) = a exp f Cor } 


2 
209 


and o? o? 
fxie(r|0) = c2 oxp f- SP ep | E822. 


where cı and c» are normalizing constants that do not depend on 0. We invoke 


Bayes' rule, 
— fe(8 r|0 
feix(8|z) = __ fe(6)fxie(z]0) 


/ fe(0) fxie(z | 6’) de 


and note that the numerator term, fe(0) fx|e(z |8), is of the form 


n ;— y? 
acap- J Go 
i=0 E 


After some algebra, which involves completing the square inside the exponent, we 
find that the numerator is of the form 


1e ( - ES), 


where 
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and d is a constant, which depends on the z; but does not depend on @. Since the 
denominator term in Bayes’ rule does not depend on @ either, we conclude that the 
posterior PDF has the form 


MEET. 
feix (8 | z) -eop[- 32). 


for some normalizing constant a, which depends on the z;, but not on 0. We 
recognize this as a normal PDF, and we conclude that the posterior PDF is normal 
with mean m and variance v. 

As a special case suppose that a2, 01,...,02 have a common value a”. Then, 
the posterior PDF of O is normal with mean and variance 


_ Tot- +Tn c? 


n+1 ^ nue 


respectively. In this case, the prior mean zo acts just as another observation, and 
contributes equally to determine the posterior mean m of O. Notice also that the 
standard deviation of the posterior PDF of O tends to zero, at the rough rate of 
1/ Vn, as the number of observations increases. 

If the variances o? are different, the formula for the posterior mean m is still 
a weighted average of the z;, but with a larger weight on the observations with 
smaller variance. 


The preceding example has the remarkable property that the posterior 
distribution of O is in the same family as the prior distribution, namely, the 
family of normal distributions. This is appealing for two reasons: 


(a) The posterior can be characterized in terms of only two numbers, the mean 
and the variance. 


(b) The form of the solution opens up the possibility of efficient recursive 
inference. Suppose that after Xi,..., Xn are observed, an additional 
observation X4.1 is obtained. Instead of solving the inference problem from 
scratch, we can view fe|x,,..., x, as our prior, and use the new observation 
to obtain the new posterior fo|x,,.,x,,x,,,. We may then apply the 
solution of Example 8.3 to this new problem. It is then plausible (and can 
be formally established) that the new posterior of O will have mean 


(m/v) + (tn41/02,1) 
(1/v) + (1/0241) 


and variance i 


(1/v) + (1/0241) 
where m and v are the mean and variance of the old posterior fe|x, ,..., x,. 


This situation where the posterior is in the same family of distributions as 
the prior is not very common. Besides the normal family, another prominent 
example involves coin tosses/Bernoulli trials and binomial distributions. 
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Example 8.4. Beta Priors on the Bias of a Coin. We wish to estimate the 
probability of heads, denoted by 0, of a biased coin. We model Ó as the value of a 
random variable O with known prior PDF fe. We consider n independent tosses 
and let X be the number of heads observed. From Bayes' rule, the posterior PDF 
of © has the form, for 6 € [0, 1], 


feix (0| k) = cfe(0)px|e(k |0) 
= d fe(0)0*(1 — 8)^-*, 


where c is a normalizing constant (independent of 0), and d = e(1). 
Suppose now that the prior is a beta density with integer parameters a > 0 
and 8 > 0, of the form 


B(o, 8) 


1 a—1 B-1 : 
L—-0^ (1-0) , if0«0«1, 
fe(6) = 
0, otherwise, 


where B(a, 8) is a normalizing constant, known as the Beta function, given by 


9 . 


the last equality can be obtained from integration by parts, or through a proba- 
bilistic argument (see Problem 30 in Chapter 3). Then, the posterior PDF of © is 
of the form 





d k-ca-1 n—-k+8-1 
6|k) = 6 1-8 : 0<6<1, 


and hence is a beta density with parameters 
a =k+a,  f'—-n-k-B. 


In the special case where a = 3 = 1, the prior fe is the uniform density over (0, 1]. 
In this case. the posterior density is beta with parameters k + l andn—k+1. 

The beta density arises often in inference applications and has interesting 
properties. In particular, if O has a beta density with integer parameters a » 0 
and B > 0, its mth moment is given by 


E[0"] = san | g™+e-1(1 ep) dé 


_ Bla+™m, B) 

~ B(a,8) 

u ala + 1) --- (a 4- m — 1) 

^ (a+ B)(a-- 8-1): (a--8--m— 1) 


The preceding examples involved a continuous random variable O, and were 
typical of parameter estimation problems. The following is a discrete example, 
and is typical of those arising in binary hypothesis testing problems. 
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Example 8.5. Spam Filtering. An email message may be “spam” or “legit- 
imate.” We introduce a parameter O, taking values 1 and 2, corresponding to 
spam and legitimate, respectively, with given probabilities pe(1) and pe(2). Let 
{wi,...,Wn} be a collection of special words (or combinations of words) whose 
appearance suggests a spam message. For each i, let X; be the Bernoulli random 
variable that models the appearance of wi in the message (Xi = 1 if w, appears and 
X; = Oif it does not). We assume that the conditional probabilities px, e(z; | 1) and 
Px;,|e(Zi |2), z; = 0,1, are known. For simplicity we also assume that conditioned 
on ©, the random variables Xi,..., Xn are independent. 

We calculate the posterior probabilities of spam and legitimate messages using 
Bayes’ rule. We have 


pe(m)] [»x;e(z: | m) 
P(O-m|X| 2z1,..., Xn = In) = ——_— m = 1,2. 


— 72 n ? 
3 peli) [»xie (2:1) 
j21 i=l 


These posterior probabilities may be used to classify the message as spam or legit- 
imate, by using methods to be discussed later. 


Multiparameter Problems 


Our discussion has so far focused on the case of a single unknown parameter. 
The case of multiple unknown parameters is entirely similar. Our next example 
involves a two-dimensional parameter. 


Example 8.6. Localization Using a Sensor Network. There are n acous- 
tic sensors, spread out over a geographic region of interest. The coordinates of 
the ith sensor are (ai,b:). A vehicle with known acoustic signature is located in 
this region, at unknown Cartesian coordinates © = (01,02). Every sensor has a 
distance-dependent probability of detecting the vehicle (i.e., “picking up” the vehi- 
cle’s signature). Based on which sensors detected the vehicle, and which ones did 
not, we would like to infer as much as possible about its location. 

The prior PDF fe is meant to capture our beliefs on the location of the 
vehicle, possibly based on previous observations. Let us assume, for simplicity, 
that ©; and O» are independent normal random variables with zero mean and unit 
variance, so that z 5 

fe(061,02) = we (1 + 62)/2, 


Let X; be equal to 1 (respectively, 0) if sensor i has detected the vehicle. 
To model signal attenuation, we assume that the probability of detection decreases 
exponentially with the distance between the vehicle and the sensor, which we denote 
by di (01,02). More precisely, we use the model 


P(X; = 1|© = (01,62)) = px,jo(1] 01,02) =e ^:0192, 
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Figure 8.2. Localization using a sensor network. 


where d;(61,62) = (a; — 801)? + (b; — 62)?. Furthermore, we assume that condi- 
tioned on the vehicle's location 6, the X; are independent. 

To calculate the posterior PDF, let S be the set of sensors for which X; = 1. 
The numerator term in Bayes’ formula for fex (0 | z) is 


fe(8)pxie(z]8) = il. + 63)/2 Lee I] (1 = e tien) 
ies igs 


where z is the vector with components z; = 1 for i € S, and z; = 0 for i € S. The 
denominator is obtained by integrating the numerator ever 0) and 65. 


As Example 8.6 illustrates, the principles for calculating the posterior PDF 
feix(0|x) are essentially the same, regardless of whether © consists of one or 
of multiple components. However, while the posterior PDF can be obtained 
in principle using Bayes’ rule, a closed form solution should not be expected 
in general. In practice, some numerical computation may be required. Often, 
calculating the normalizing constant in the denominator of Bayes' formula can 
be challenging. In Example 8.6, the denominator is a double integral, over the 
variables 01 and 69, and its numerical evaluation is manageable. If, however, 
O is high-dimensional, numerical integration becomes formidable. There exist 
sophisticated methods that can often carry out this integration approximately, 
based on random sampling, but they are beyond our scope. 

When 6 = (0,,..., 94) is multidimensional, we are sometimes interested 
in just a single component of O, say O1. We may then focus on fo, x (£1 | x), 
the marginal posterior PDF of O1, which can be obtained from the formula 


feux(&i12) = f... f feyx (01s 6... 0 |) des dm. 


Note, however, that when O is high-dimensional, evaluating this multiple integral 
can be difficult. 


8.2 
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POINT ESTIMATION, HYPOTHESIS TESTING, AND THE 
MAP RULE 


We will now introduce a simple and general Bayesian inference method, and 
discuss its application to estimation and hypothesis testing problems. Given the 
value x of the observation, we select a value of 0, denoted 9, that maximizes the 
posterior distribution pg) x (0| x) [or fejx (0 | x), if O is continuous]: 


8 = arg max pejx (8 |z), (O discrete), 


Ó = arg max feix(8 | x), (O continuous). 


This is called the Maximum a Posteriori probability (MAP) rule (see 
Fig. 8.3). 


Posterior ii Posterior l 
feix (8 | z) Peix(8 | x) 





Figure 8.3. Illustration of the MAP rule for inference of a continuous parameter 
(left figure) and a discrete parameter (right figure). 


When O is discrete, the MAP rule has an important optimality property: 
since it chooses Ê to be the most likely value of O, it maximizes the probability of 
correct decision for any given value z. This implies that it also maximizes (over 
all decision rules) the overall (averaged over all possible values x) probability 
of correct decision. Equivalently, the MAP rule minimizes the probability of an 
incorrect decision [for each observation value z, as well as the overall probability 
of error (averaged over z)].t 


t To state this more precisely, let us consider a general decision rule, which upon 
observing the value z, selects a value of 0 denoted by g(z). Denote also by gmap(-) the 
MAP rule. Let J and Imap be Bernoulli random variables that are equal to 1 whenever 
the general decision rule (respectively, the MAP rule) makes a correct decision; thus, 
the event {J = 1} is the same as the event (g(.X) = ©}, and similarly for gmap. By 
the definition of the MAP rule, 


E|! |X] = P(g(X) = 8| X) € P(gmap(X) = 8| X) = E[IuaP | X], 


for any possible realization of X. Using the law of iterated expectations, we obtain 


Sec. 8.2 Point Estimation, Hypothesis Testing, and the MAP Rule 421 


The form of the posterior distribution, as given by Bayes' rule, allows 
an important computational shortcut: the denominator is the same for all 0 
and depends only on the value x of the observation. Thus, to maximize the 
posterior, we only need to choose a value of 0 that maximizes the numerator 
pe(0)px|e(r|0) if © and X are discrete, or similar expressions if O and/or X 
are continuous. Calculation of the denominator is unnecessary. 


The Maximum a Posteriori Probability (MAP) Rule 


e Given the observation value z, the MAP rule selects a value Ó that 
maximizes over 6 the posterior distribution pex (0 |x) (if © is discrete) 
or fe|x (0|) (if © is continuous). 


Equivalently, it selects Ê that maximizes over 6: 


if O and X are discrete), 


( 
pe( 
fe( 
fe( 


e If O takes only a finite number of values, the MAP rule minimizes (over 
all decision rules) the probability of selecting an incorrect hypothesis. 
This is true for both the unconditional probability of error and the 
conditional one, given any observation value z. 


if O is continuous and X is discrete), 


( 
(if O is discrete and X is continuous), 
( 
( 


if O and X are continuous). 





Let us illustrate the MAP rule by revisiting some of the examples in the 
preceding section. 


Example 8.3 (continued). Here, © is a normal random variable, with mean zo 
and variance o2. We observe a collection X = (Xi,..., Xn) of random variables 
which conditioned on the value 0 of O, are independent normal random variables 


with mean ô and variances o2..... c2. We found that the posterior PDF is normal 


E[I] € E[Jmap], or 

P (g(X) = ©) € P(guar(X) - 6). 
Thus, the MAP rule maximizes the overall probability of a correct decision over all 
decision rules g. Note that this argument is mostly relevant when O is discrete. If 


©, when conditioned on X = z, is a continuous random variable. the probability of a 
correct decision is 0 under any rule. 
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with mean 7n and variance v, given by 


n 
X azio; 


m = E[O0| X = z] = ————, v = var(G| X =2) = 4——. 
Vet Y ue? 
i=0 1-0 


Since the normal PDF is maximized at its mean, the MAP estimate is Ó — m. 


Example 8.5 (continued). Here the parameter O takes values 1 and 2, corre- 
sponding to spam and legitimate messages, respectively, with given probabilities 
pe(1) and pe(2), and Xi is the Bernoulli random variable that models the appear- 
ance of w; in the message (X; = 1 if wi appears and X; = 0 if it does not). We 
have calculated the posterior probabilities of spam and legitimate messages as 


pe(m)] [»x;ie(z:| m) 
P(O-m|Xi-—2z1,...,Xa = z4) = ————————————, Ml, 


2 "n 1 
X pe) [ [px,10 (i12) 
j=l i=l 


Suppose we want to classify a message as spam or legitimate based on the cor- 
responding vector (Z1,...,2n). Then, the MAP rule decides that the message is 
spam if 


P(9 =1|X1 = z1,..., Xn = Tn) > P(O = 2| X1 = 21,..., Xn = Tn), 
or equivalently, if 


pe(1)] [px;0(2:11) > pe(2)] [pxio (12. 


Point Estimation 


In an estimation problem, given the observed value z of X, the posterior dis- 
tribution captures all the relevant information provided by z. On the other 
hand, we may be interested in certain quantities that summarize properties of 
the posterior. For example, we may select a point estimate, which is a single 
numerical value that represents our best guess of the value of O. 

Let us introduce some concepts and terminology relating to estimation. For 
simplicity, we assume that O is one-dimensional, but the methodology extends 
to other cases. We use the term estimate to refer to the numerical value 6 that 
we choose to report on the basis of the actual observation z. The value of @ is 
to be determined by applying some function g to the observation z, resulting 
in Ó = g(x). The random variable Ô = g(X) is called an estimator, and its 
realized value equals g(r) whenever the random variable X takes the value z. 
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The reason that Ó is a random variable is that the outcome of the estimation 
procedure depends on the random value of the observation. 

We can use different functions g to form different estimators; some will be 
better than others. For an extreme example, consider the function that satisfies 
g(x) = 0 for all rz. The resulting estimator, © = 0, makes no use of the data, and 
is therefore not a good choice. We have already seen two of the most popular 
estimators: 


(a) The Maximum a Posteriori Probability (MAP) estimator. Here, 
having observed z, we choose an estimate 0 that maximizes the posterior 
distribution over all 0, breaking ties arbitrarily. 


(b) The Conditional Expectation estimator, introduced in Section 4.3. Here, 
we choose the estimate 6 = E[O | X = zx]. 


The conditional expectation estimator will be discussed in detail in the 
next section. It will be called there the least mean squares (LMS) estimator 
because it has an important property: it minimizes the mean squared error over 
all estimators, as we show later. Regarding the MAP estimator, we have a few 
remarks. 


(a) If the posterior distribution of O is symmetric around its (conditional) mean 
and unimodal (i.e., has a single maximum), the maximum occurs at the 
mean. Then, the MAP estimator coincides with the conditional expectation 
estimator. This is the case, for example, if the posterior distribution is 
guaranteed to be normal, as in Example 8.3. 


(b) If O is continuous, the actual evaluation of the MAP estimate Ô can some- 
times be carried out analytically; for example, if there are no constraints on 
0, by setting to zero the derivative of fejx(0 |x), or of log fojx (0 |x), and 
solving for 6. In other cases, however, a numerical search may be required. 


Point Estimates 


e An estimator is a random variable of the form Ô = g(X), for some 
function g. Different choices of g correspond to different estimators. 


An estimate is the value Ó of an estimator, as determined by the 
realized value x of the observation X. 


Once the value z of X is observed, the Maximum a Posteriori 
Probability (MAP) estimator, sets the estimate 0 to a value that 
maximizes the posterior distribution over all possible values of 0. 


Once the value z of X is observed, the Conditional Expectation 
(LMS) estimator sets the estimate 0 to E[O | X = z]. 
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Example 8.7. Consider Example 8.2. in which Juliet is late on the first date by a 
random amount X. The distribution of X is uniform over the interval [0, ©], and 
O is an unknown random variable with a uniform prior PDF fe over the interval 
[0,1]. In that example, we saw that for z € [0, 1], the posterior PDF is 


1 


feix(0|z) = ay [log z] 
0, otherwise. 


ifr<0< 1, 


For a given x. fe|x(0| x) is decreasing in 6, over the range |z, 1] of possible values 
of O. Thus. the MAP estimate is equal to r. Note that this is an "optimistic" 
estimate. If Juliet is late by a small amount on the first date (z = 0), the estimate 
of future lateness is also small. 

The conditional expectation estimate turns out to be less "optimistic." In 
particular, we have 


1 coq oeup 
6-|logz| |logz| 


six ea] - f o 


The two estimates are plotted as functions of z in Fig. 8.4. It can be seen that for 
any observed lateness z. E(O| X = z] is larger than the MAP estimate of O. 
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Figure 8.4. MAP and conditional expectation estimates as functions of the 
observation x in Example 8.7. 


Example 8.8. We consider the model in Example 8.4, where we observe the 
number X of heads in n independent tosses of a biased coin. We assume that the 
prior distribution of ©, the probability of heads, is uniform over [0,1]. We will 
derive the MAP and conditional expectation estimators of 9. 
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As shown in Example 8.4, when X — k, the posterior PDF of O is a beta 
density with parameters a = k-- 1andg8—n-—k-l1: 


1 


feix (0| k) = i B(k+1,n-k+1) 
0, otherwise. 


0*(1—60)^-*, if @€ [0,1], 


This posterior PDF turns out to have a single peak. To find its location, we dif- 
ferentiate the expression 6* (1 — 0)"^* with respect to 6, and set the derivative to 
zero, to obtain 


k9* (1 — 8)"-* — (n — k)e*(1—8)"-*^! = 0, 


which yields 
~ k 
0 = —. 
n 
This is the MAP estimate. 
To obtain the conditional expectation estimate, we use the formula for the 
mean of the beta density (cf. Example 8.4): 


k+1 


Note that for large values of n, the MAP and conditional expectation estimates 
nearly coincide. 


In the absence of additional assumptions, a point estimate carries no guar- 
antees on its accuracy. For example, the MAP estimate may lie quite far from 
the bulk of the posterior distribution. Thus. it is usually desirable to also re- 
port some additional information, such as the conditional mean squared error 
E|(6 — 0)?|X = al: In the next section, we will discuss this issue further. 
In particular, we will revisit the two preceding examples and we will calculate 
the conditional mean squared error for the MAP and conditional expectation 
estimates. 


Hypothesis Testing 


In a hypothesis testing problem, © takes one of m values, 61,...,9m, where m 
is usually a small integer; often m — 2, in which case we are dealing with a 
binary hypothesis testing problem. We refer to the event (O = 6i) as the ith 
hypothesis, and denote it by Hi. 

Once the value z of X is observed, we may use Bayes' rule to calculate 
the posterior probabilities P(O = 0; | X = z) = pejx(0i| x), for each i. We may 
then select the hypothesis whose posterior probability is largest, according to 
the MAP rule. (If there is a tie, with several hypotheses attaining the largest 
posterior probability, we can choose among them arbitrarily.) As mentioned 
earlier, the MAP rule is optimal in the sense that it maximizes the probability 
of correct decision over all decision rules. 
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The MAP Rule for Hypothesis Testing 


e Given the observation value z, the MAP rule selects a hypothesis H; 
for which the value of the posterior probability P(O = 0;| X = z) is 
largest. 


e Equivalently, it selects a hypothesis Hi for which pe(0;)Px|e(z | 4) (if 
X is discrete) or pe(0;) f xje(z |0;) (if X is continuous) is largest. 


e The MAP rule minimizes the probability of selecting an incorrect hy- 
pothesis for any observation value z, as well as the probability of error 
over all decision rules. 





Once we have derived the MAP rule, we may also compute the correspond- 
ing probability of a correct decision (or error), as a function of the observation 
value z. In particular, if gmap(z) is the hypothesis selected by the MAP rule 
when X = q, the probability of correct decision is 


P(O = gMaAPr(x)| X = z). 


Furthermore, if S; is the set of all z such that the MAP rule selects hypothesis 
Hi, the overall probability of correct decision is 


P(O = guaP(X)) = È P(O = 6, X € Si), 


and the corresponding probability of error is 


Y P(0z6.X e Si). 


The following is a typical example of MAP rule calculations for the case of 
two hypotheses. 


Example 8.9. We have two biased coins, referred to as coins 1 and 2, with prob- 
abilities of heads equal to pi and p», respectively. We choose a coin at random 
(either coin is equally likely to be chosen) and we want to infer its identity, based 
on the outcome of a single toss. Let © = 1 and © = 2 be the hypotheses that coin 
1 or 2, respectively, was chosen. Let X be equal to 1 or 0, depending on whether 
the outcome of the toss was a head or a tail, respectively. 

Using the MAP rule, we compare pe(1)px|je(z | 1) and pe(2)pxje(z | 2), and 
decide in favor of the coin for which the corresponding expression is largest. Since 
pe(1) = pe(2) = 1/2, we only need to compare pxje(z|1) and pxje(z |2), and 
select the hypothesis under which the observed toss outcome is most likely. Thus, 
for example, if p; = 0.46, p? = 0.52. and the outcome was a tail we notice that 


P(tail|O = 1) 21— 046 > 1 — 0.52 = P(tail| © = 2), 
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and decide in favor of coin 1. 

Suppose now that we toss the selected coin n times and let X be the number of 
heads obtained. Then, the preceding argument is still valid and shows that accord- 
ing to the MAP rule, we should select the hypothesis under which the observed out- 
come is most likely [this critically depends on the assumption pe(1) — pe(2) — 1/2]. 
Thus, if X = k, we should decide that 8 = 1 if 


k n—k k ES 
pill- pi)" > p3(1— pe)”, 


and decide that O = 2 otherwise. Figure 8.5 illustrates the MAP rule. 


Posterior f Posterior 


jP(-!]X-k /5, P(0=2|X =k); 


j 


Number of heads k 


d "I 8 i $i A & 











MEE ERE: 
Choose Q = 1 Choose 6 = 2 


Figure 8.5. The MAP rule for Example 8.9, in the case where n = 50, 
pı = 0.46, and p2 = 0.52. It calculates the posterior probabilities 


P(6 =i| X =k) =c(k) pe(i) P(X =k|O =i) 


-c(k)pe(p1-p)" ^, i212, 


where c(k) is a positive normalizing constant, and chooses the hypothesis 
© = i that has the largest posterior probability. Because pe(1) = pa (2) = 
1/2 in this example, the MAP rule chooses the hypothesis O = i for which 
p*(1—pi)"—* is largest. The rule is to accept O = 1 if k € k”, where k* = 24, 
and to accept O = 2 otherwise. 


The character of the MAP rule, as illustrated in Fig. 8.5, is typical of decision 
rules in binary hypothesis testing problems: it is specified by a partition of the 
observation space into the two disjoint sets in which each of the two hypotheses is 
chosen. In this example, the MAP rule is specified by a single threshold k": accept 
O = 1 if k < k" and accept © = 2 otherwise. The overall probability of error is 
obtained by using the total probability theorem: 


P(error) = P(9 = 1, X > k") -P(O-2,X € k^) 
n k" 
=pe(1) M. c(E)pi(-m)"* +pe(2) V clk)pi (1 — p2)"7* 


k=k™+) k=1 


428 Bayesian Statistical Inference Chap. 8 


n k* 
1 c n— 1 n— 
—-5| do pi- p)" + J eoa — p) J, 
k=k"+1 kzl 


where c(k) is a positive normalizing constant. Figure 8.6 gives the probability of 
error for a threshold-type of decision rule, as a function of the threshold k*. The 
MAP rule, which in the current example corresponds to k” = 24, maximizes the 
probability of a correct decision, and hence gives the minimal probability of error. 





Error , 
ow Probability 





MAP Rule Threshold k* 


Threshold 


Figure 8.6. A plot of the probability of error for a threshold-type of decision 
rule that accepts © = 1 if k < k” and accepts © = 2 otherwise, as a function 
of the threshold k" (cf. Example 8.9). The problem data here are n — 50, 
pı = 0.46, and p2 = 0.52. the same as in Fig. 8.5. The threshold of the MAP 
rule is k* = 24, and minimizes the probability of error. 


The following is a classical example from communication engineering. 


Example 8.10. Signal Detection and the Matched Filter. A transmitter 
sends one of two possible messages. Let O — 1 or O — 2, depending on whether the 
first or the second message is transmitted. We assume that the two messages are 
equally likely, i.e., the prior probabilities are pe(1) — pe(2) — 1/2. 

In order to enhance the resiliency of the transmission with respect to noise, 
the transmitter uses a signal that extends the transmitted message over time. In 
particular, the transmitter sends a signal S = (5S1,..., Sn), where each Si is a 
real number. If O = 1 (respectively, © = 2), then S is a fixed sequence (a:,...,@n) 
[respectively, (b1,...,b54)]. We assume that the two candidate signals have the same 
"energy," i.e., a? -E...-- a2 =b? +- +b2. The receiver observes the transmitted 
signal, but corrupted by additive noise. More specifically, it obtains the observations 


Xi = S: + W;, 2115 cum, 


where we assume that the W; are standard normal random variables, independent 
of each other and independent of the signal. 
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Under the hypothesis O = 1, the X; are independent normal random vari- 
ables, with mean a; and unit variance. Thus, 





1 Jn —aif +--+ (En — an)*)/2 
V2n)” 


fxje(z|1) = 


Similarly, 


Íxie(712) = ee = b1)? +++ (En — bn)?) /2. 


From Bayes’ rule, the probability that the first message was transmitted is 


exp { z ((e1 — a1)? ++ (En —an)*) /2} + exp { = (x1 — b1)? +--+ (an = bn)?)/2} 


After expanding the squared terms and using the assumption a? +---+ a2 = 
b? + --- + 62, this expression simplifies to 


e (Q1 +++: + anza) 


bP-ixXx-eÓ5- l|zyce— UY RT | OEE PTET 
( | z) = peix (1| z) e(Q121 +++ an) 4 (bii + +++ + bntn) 


The formula for P(O = 2| X = 2) is similar, with the a; in the numerator replaced 
by b,. 

According to the MAP rule, we should choose the hypothesis with maximum 
posterior probability, which yields: 


n 


n 
select O = 1 if 5 aizi > 5 biri, 
i=l 


i=1 


select O = 2 if 5 aiti < » bii. 
i=1 


i=1 


(If the inner products above are equal, either hypothesis may be selected.) This 
particular structure for deciding which signal was transmitted is called a matched 
filter: we “match” the received signal (z1,..., £n) with each of the two candidate 
signals by forming the inner products Es aiz; and Ve bizi; we then select the 
hypothesis that yields the higher value (the “best match"). 

This example can be generalized to the case of m > 2 equally likely messages. 


We assume that for message k, the transmitter sends a fixed signal (af,... jak), 
where (af)? +-+ (ak)? is the same for all k. Then, under the same noise model, a 
similar calculation shows that the MAP rule decodes a received signal (zi,...,24) 


as the message k for which eae afz, is largest. 


8.3 
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BAYESIAN LEAST MEAN SQUARES ESTIMATION 


In this section, we discuss in more detail the conditional expectation estimator. 
In particular, we show that it results in the least possible mean squared error 
(hence the abbreviation LMS for least mean squares), and we explore some of 
its other properties. 

We start by considering the simpler problem of estimating O with a con- 
stant 6. in the absence of an observation X. The estimation error 0 — O is random 
(because O is random), but the mean squared error E|(o - 6)?] is a number 


that depends on 6, and can be minimized over 6. With this criterion, it turns 
out that the best possible estimate is to set Ê equal to E[O], as we proceed to 


verify. 


For any estimate 6, we have 
E [(© — 6)2] = var(O — 6) + (EjO — ĝl)? = var(O) + (ElO] — 6)’; 


the first equality uses the formula E[Z?] = var(Z) + (E[Z I)’, and the second 
holds because when the constant 8 is subtracted froin the random variable O, 
the variance is unaffected while the mean is reduced by 0. We now note that the 
term var(®) does not depend on 0. Therefore, we should choose @ to minimize 
the term (E[O] — 6)*, which leads to 6 = E[O] (see Fig. 8.7). 


Mean Squared 
Estimation Error 


E[(6 — 6)?! = var(0) + (E[O] — 6)” 





Figure 8.7: The mean squared error E[(e - ae as a function of the estimate 


6, is a quadratic in 6, and is minimized when 6 = E[O]. The minimum value of 
the mean squared error is var(@). 


Suppose now that we use an observation X to estimate O, so as to minimize 
the mean squared error. Once we know the value x of X, the situation is iden- 
tical to the one considered earlier, except that we are now in a new “universe,” 
where everything is conditioned on X = z. We can therefore adapt our earlier 
conclusion and assert that the conditional expectation E/6 |X = zr] minimizes 
the conditional mean squared error E[(@ — 6)? | X = z] over all constants 6. 
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Generally, the (unconditional) mean squared estimation error associated 
with an estimator g(.X) is defined as 


E[(6 - 9(X))’]. 


If we view E[O | X] as an estimator/function of X, the preceding analysis shows 
that out of all possible estimators, the mean squared estimation error is mini- 
mized when g(X) = E[O | X]. 














Key Facts About Least Mean Squares Estimation 


e In the absence of any observations, E[(O — 8)? is minimized when 
Ó = EJO]: 


E[(e-Eje])' «E[(e-óy], for all 6. 


e For any given value z of X, E[(© —6)2|X = T| is minimized when 
6 = E[O | X =q]: 


El(e-Ele|x =a)" | X=2] < E[(@-6)2|X =a], for all Ê. 


e Out of all estimators g(X) of © based on X, the mean squared esti- 
mation error E|(e — s)! is minimized when g(X) = E[O | X]: 


E|(6 — E(0 | xp < E|(6 — so] : for all estimators g( X). 


tł For any given value z of X, g(x) is a number, and therefore, 


E|(e - Ele |x = z)* | X =2] «E|(6 - 9(z))” | x =a]. 


Thus, 
E|(e - rex)? | x] < fe-s) | x]. 


which is now an inequality between random variables (functions of X). We take ex- 
pectations of both sides, and use the law of iterated expectations, to conclude that 


E[(e - ee1x))'] < e [te - 07]. 


for all estimators g( X). 
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Example 8.11. Let O be uniformly distributed over the interval [4, 10] and suppose 
that we observe O with some random error W. In particular, we observe the value 
of the random variable 

X =6+W, 


where we assume that W is uniformly distributed over the interval [—1.1] and 
independent of O. 

To calculate E[O| X = x], we note that fe(0) = 1/6, if 4 € 0 € 10, and 
fe(0) = 0, otherwise. Conditioned on © being equal to some 6, X is the same as 
@+ W, and is uniformly distributed over the interval [0 — 1,0 + 1]. Thus, the joint 


PDF is given by 


1 1 1 
fe.x(8, z) = fe(0)fxie(z10) = 2: 5 = 15. 
if 4 € 0 € 10 and 8 — 1 € z € 0 « 1, and is zero for all other values of (0, z). The 
parallelogram in the right-hand side of Fig. 8.8 is the set of pairs (0, x) for which 


fe. x (0, x) is nonzero. 


fe(8) 40 


X= O+W 
Wis uniformly distributed 
in the interval [~1, 1] 





Least squares estimate 
"EO | X =a] 
3 5 9-11 x 


Figure 8.8: The PDFs in Example 8.11. The joint PDF of O and X is 
uniform over the parallelogram shown on the right. The LMS estimate of ©, 
given the value x of the random variable X = O + W, depends on z and is 
represented by the piecewise linear function shown on the right. 


Given that X = x, the posterior PDF fex is uniform on the correspond- 
ing vertical section of the parallelogram. Thus E[O| X = z] is the midpoint of 
that section, which in this example happens to be a piecewise linear function of 
rz. Conditioned on a particular value z of X, the mean squared error, defined as 
E|(6 - E[8 | xj X= B is the conditional variance of ©. It is a function of z, 
illustrated in Fig. 8.9. 


Example 8.12. Consider Example 8.7, in which Juliet is late on the first date by 
a random amount X that is uniformly distributed over the interval [0,6]. Here, 
O is an unknown random variable with a uniform prior PDF fe over the interval 
(0, 1]. In that example, we saw that the MAP estimate is equal to x and that the 


Sec. 8.3 Bayesian Least Mean Squares Estimation 433 


Conditional 
Mean Squared 
Estimation Error 





Figure 8.9: The conditional mean squared error in Example 8.11, as a function 
of the observed value z of X. Note that certain values of the observation are more 
favorable than others. For example. if X = 3, then we are certain that 6 = 4, 
and the conditional mean squared error is zero. 


LMS estimate is 


1 
1 l-r 


Let us calculate the conditional mean squared error for the MAP and the 
LMS estimates. Given X = z, for any estimate 6, we have 


1 
[(6-@ 1x =2] = f 6-4). 
i i 
m 02 .— Â 2 pea 
fe 260+ 6). r 
-gp 420-2 1-2? 
B | log z| 2| log z| 


dé 





8| log x| 


For the MAP estimate, 6 = z, the conditional mean squared error is 


ó 2 15,2, 32 5 -4r4]1 
Ej(-ey|X-z|-z'4 eps 


For the LMS estimate, @ = (1 — z)/|log z|. the conditional mean squared error is 


Elj-ej[xcu dt (ui) 


B 2| log z| ~ Vlogz 


The conditional mean squared errors of the two estimates are plotted in 
Fig. 8.10, as functions of z. and it can be seen that the LMS estimator has uni- 
formly smaller mean squared error. This is a manifestation of the general optimality 
property of the LMS estimator that we have established. 
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Estimates 





Figure 8.10. MAP and LMS estimates, and their conditional mean squared 
errors as functions of the observation xz in Example 8.12. 


Example 8.13. We consider the model in Example 8.8, where we observe the 
number X of heads in n independent tosses of a biased coin. We assume that the 
prior distribution of O, the probability of heads, is uniform over [0,1]. In that 
example, we saw that when X = k, the posterior density is beta with parameters 
a=k+1 and 8 —m-k-1,and that the MAP estimate is equal to k/n. By using 
the formula for the moments of the beta density (cf. Example 8.4), we have 


(k + 1)(k + 2)---(kK +m) 


Epp es En m 


and in particular, the LMS estimate is 


k+1 


Given X = k, the conditional mean squared error for any estimate 6 is 


E [(6- ©)? |X =k] 26? -28E[O| X = k] + E[e?| X =k] 


2 og k+l | (k* D 2) 
n2 (n+2)(n+3) 


> 





The conditional mean squared error of the MAP estimate is 


E[ó-ey | x=4] -B| (2-0) 





| 


a ok k*l, (k+1)(k+2) 
n n42  (n4-2)(n43) 
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The conditional mean squared error of the LMS estimate is 


E [(6 - ©)? | X =k] = E[6?| X =k] - (Ele|X = x) 
_ (kK+1)(k+2) _ eS) 
^ (n+ 2)(n+ 3) n+2 





The results are plotted in Fig. 8.11 for the case of n = 15 tosses. Note that, as in 
the preceding example, the LMS estimate has a uniformly smaller conditional mean 
squared error. 


Estimates 





Figure 8.11. MAP and LMS estimates, and corresponding conditional mean 
squared errors as functions of the observed number of heads k in n = 15 tosses 
(cf. Example 8.13). 


Some Properties of the Estimation Error 


Let us use the notation 
ô = E[O | X], 6-ó6-e, 


for the LMS estimator and the associated estimation error, respectively. The 
random variables Ó and Ó have a number of useful properties, which were derived 
in Section 4.3 and for easy reference are repeated below. (Note the change 
in notation: while in Section 4.3, the observation was denoted by Y and the 
estimated parameter was denoted by X, here they are denoted by X and ©, 
respectively.) 
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Properties of the Estimation Error 


e The estimation error O is unbiased, i.e., it has zero unconditional 
and conditional mean: 


E[6] = 0, E[9 |X =z] =0, for all z. 
e The estimation error Õ is uncorrelated with the estimate Ô: 
cov(Ó, Õ) = 0. 
e The variance of © can be decomposed as 


var(9) = var(Ó) + var(O). 





Example 8.14. Let us say that the observation X is uninformative if the mean 
squared estimation error E[6?] = var(Ó) is the same as var(O). the unconditional 
variance of O. When is this the case? 

Using the formula 


var(OQ) = var(Ó) + var(9), 


we see that X is uninformative if and only if var (©) = 0. The variance of a random 
variable is zero if and only if that random variable is a constant, equal to its mean. 
We conclude that X is uninformative if and only if the estimate Ó = E[O | X] is 
equal to E[O], for every value of X. 

If © and X are independent, we have E[O | X = z] = E[6] for all z, and X is 
indeed uninformative, which is quite intuitive. The converse. however, is not true: 
it is possible for E[O | X — z] to be always equal to the constant E[O], without O 
and X being independent. (Can you construct an example?) 


The Case of Multiple Observations and Multiple Parameters 


The preceding discussion was phrased as if X were a single random variable. 
However, the entire argument and its conclusions apply even if X is a vector of 
random variables, X = (X1,..., Xn). Thus, the mean squared estimation error 
is minimized if we use E[O | X;...., Xn] as our estimator, i.e., 


E|(6 - Ej6 | X....., Xn)" < E|(o - Qn... X4) ], 
for all estimators g(X1,..., Xn). 


This provides a complete solution to the general problem of LMS estima- 
tion, but is often difficult to implement, for the following reasons: 


8.4 
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(a) In order to compute the conditional expectation E[O | X1,...,Xn], we need 
a complete probabilistic model, that is, the joint PDF fe.x,,....x,- 


(b) Even if this joint PDF is available, E[O | X1,..., Xn] can be a very com- 
plicated function of X1,..., Xn. 


As a consequence, practitioners often resort to approximations of the conditional 
expectation or focus on estimators that are not optimal but are simple and easy to 
implement. The most common approach, discussed in the next section, involves 
a restriction to linear estimators. 

Finally, let us consider the case where we want to estimate multiple pa- 
rameters O,,..., 05. It is then natural to consider the criterion 


E[(0: - 61)?] +--+ E((65 - Om)?]. 


and minimize it over all estimators Ô, m Om: But this is equivalent to find- 
ing, for each i, an estimator O; that minimizes E[(9; — 61?], so that we are 
essentially dealing with m decoupled estimation problems, one for each unknown 
parameter 9, yielding ©; = E[Oi | X1,..., Xn], for all i. 


BAYESIAN LINEAR LEAST MEAN SQUARES ESTIMATION 


In this section, we derive an estimator that minimizes the mean squared error 
within a restricted class of estimators: those that are linear functions of the 
observations. While this estimator may result in higher mean squared error, it 
has a significant practical advantage: it requires simple calculations, involving 
only means, variances, and covariances of the parameters and observations. It is 
thus a useful alternative to the conditional expectation/LMS estimator in cases 
where the latter is hard to compute. 
A linear estimator of a random variable ©, based on observations X1,..., Xn. 

has the form 


^ 


O-—aiXi t: +anXn +b. 


Given a particular choice of the scalars aj,...,@n,b, the corresponding mean 
squared error is 

E|(0 — a1X1 — --- — a4X4 — b)?]. 
The linear LMS estimator chooses a1,...,05,b to minimize the above expression. 


We first develop the solution for the case where n = 1, and then generalize. 
Linear Least Mean Squares Estimation Based on a Single Observation 
We are interested in finding a and b that minimize the mean squared estimation 


error E (O —aX — b)?] associated with a linear estimator a X + b of ©. Suppose 
that a has already been chosen. How should we choose b? This is the same as 
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choosing a constant b to estimate the random variable O — aX. By the discussion 
in the beginning of Section 8.3, the best choice is 


b = E[O — aX] = E[6] - aE[X]. 
With this choice of b, it remains to minimize, with respect to a, the expression 
E|(e — aX —E[0] + aE[X]). . 
We write this expression as 
var(O — aX) = o2 + a?0% + 2cov(O, —aX) = o2 + a?c?, — 2a - cov(O, X), 
where ce and ox are the standard deviations of O and X, respectively, and 
cov(O, X) = E|(6 - E[6]) (X - E[x])] 


is the covariance of O and X. To minimize var(O — aX) (a quadratic function 
of a), we set its derivative to zero and solve for a. This yields 


cov(O, X)  poeox ce 
Rug ae ee md 
where 
cov(O, X) 
ooox 


is the correlation coefficient. With this choice of a, the mean squared estimation 
error of the resulting linear estimator O is given by 


var(O — Ô) = o2, + a?o}, — 2a - cov(O, X) 


2 
[0] ee 
uL 22 2-8 2 
Ə TP c? X n eox 
= (1— p?)o3. 


Linear LMS Estimation Formulas 
e The linear LMS estimator Ó of © based on X is 


cov(O, X) 


6 = E[O] + (X — E[X]) = E[6] + p22 (x - E[X]), 


var( X) ox 


where 
_ cov(O, X) 


cecx 
is the correlation coefficient. 


e The resulting mean squared estimation error is equal to 


(1 — p?)o8. 
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The formula for the linear LMS estimator only involves the means, vari- 
ances, and covariance of O and X. Furthermore, it has an intuitive interpreta- 
tion. Suppose, for concreteness, that the correlation coefficient p is positive. The 
estimator starts with the baseline estimate E[O] for ©, which it then adjusts by 
taking into account the value of X — E[X]. For example, when X is larger than 
its mean, the positive correlation between X and O suggests that O is expected 
to be larger than its mean. Accordingly, the resulting estimate is set to a value 
larger than E[O]. The value of p also affects the quality of the estimate. When 
|p| is close to 1, the two random variables are highly correlated, and knowing X 
allows us to accurately estimate O, resulting in a small mean squared error. 

We finally note that the properties of the estimation error presented in 
Section 8.3 can be shown to hold when Ó is the linear LMS estimator; see the 
end-of-chapter problems. 


Example 8.15. We revisit the model in Examples 8.2, 8.7, and 8.12, in which 
Juliet is always late by an amount X that is uniformly distributed over the interval 
(0, ©], and © is a random variable with a uniform prior PDF fe(0) over the interval 
[0, 1]. Let us derive the linear LMS estimator of © based on X. 

Using the fact that E(X | O] 2 O/2 and the law of iterated expectations, the 
expected value of X is 


2] E[Ə] 1 
2 4 


E[X] = E[E[X|6]] = E ls 


Furthermore, using the law of total variance (this is the same calculation as in 
Example 4.17 of Chapter 4), we have 


We now find the covariance of X and O, using the formula 
cov(O, X) = E[OX|] - E[6] E[X], 


and the fact i i 
E[6?] = var(9) + (E[Ə])” = 3*1 
We have 


e?| 1 
E[ex] = E[E[ex | 6]] = E[e E(x |6]] = E | == 


where the first equality follows from the law of iterated expectations, and the second 
equality holds since for all 0, 


E[OX |© = 6] = Ef X |© 20] = 6E[X |© = 6]. 
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Thus, 
ov(0, X) = E[OX] - Ee]epx = 1 - 1.12 .L 
NEAR 6 24 24 
The linear LMS estimator is 
a cov(O, X), _ _1l 1/24 ( -1-$ 2 
PRBE OD t= Oo eqs. 


The corresponding conditional mean squared error is calculated using the 
formula derived in Example 8.12, 


j orix acl ag GQ-23,1-25 
E |- 0} |X =z] =ô ET 2| log z|' 


and substituting the expression just derived, 6 = (6/7)x + (2/7). In Fig. 8.12, we 
compare the linear LMS estimator with the MAP estimator and the LMS estimator 
(cf. Examples 8.2, 8.7, and 8.12). Note that the LMS and linear LMS estimators 
are nearly identical for much of the region of interest, and so are the corresponding 
conditional mean squared errors. The MAP estimator has significantly larger mean 
squared error than the other two estimators. For r close to 1, the linear LMS 
estimator performs worse than the other two estimators, and indeed may give an 
estimate Ê > 1, which is outside the range of possible values of ©. 


Estimates 


MAP Estimates 


LMS Estimates 
Linear LMS Estimates 








Figure 8.12. Three estimators and their mean squared errors, as functions of 
the observed value z. for the problem in Example 8.15. 
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Example 8.16. Linear LMS Estimation of the Bias of a Coin. We revisit 
the coin tossing problem of Examples 8.4, 8.8, and 8.13, and derive the linear 
LMS estimator. Here, the probability of heads of the coin is modeled as a random 
variable O whose prior distribution is uniform over the interval [0, 1]. The coin is 
tossed n times, independently, resulting in a random number of heads, denoted by 
X. Thus, if O is equal to 0, the random variable X has a binomial distribution 
with parameters n and @. 

We calculate the various coefficients that appear in the formula for the linear 
LMS estimator. We have E[O] = 1/2, and 


E[X] = E[E[X | 9]] = E[ne] = 2 


The variance of © is 1/12, so that ce = 1/V 12. Also, as calculated in the previous 
example, E[O?] = 1/3. If O takes the value 6, the (conditional) variance of X is 
n0(1-— 0). Using the law of total variance, we obtain 


var(X) = E[var(X | 9)] + var(E[X | 9]) 


[ne(1 — ©)] + var(n6) 


E 
E 
2. 3*1 
n(n + 2) 

12 ` 


In order to find the covariance of X and 6, we use the formula 
cov(0, X) = E[OX] - E[6] E[X] = E[e X] - T 


Similar to Example 8.15, we have 


Eje x] = E[Eje X | ej] = E[e E[X | e]] = E[ne?] = 2 
so that 
n n nmn 
cov(O, X) = 3 -— 4 = 12 

Putting everything together, we conclude that the linear LMS estimator takes the 
form 

4 1 n/12 n\ 1 1 ni X+1 

Menem n(n 4- 2)/12 (x=5) 72*n42 (x 2] ^ n42' 


Notice that this agrees with the LMS estimator that we derived in Example 8.13. 
This should not be a surprise: if the LMS estimator turns out to be linear, as was 
the case in Example 8.13, then that estimator is also optimal within the smaller 
class of linear estimators. 
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The Case of Multiple Observations and Multiple Parameters 


The linear LMS methodology extends to the case of multiple observations. There 
is an analogous formula for the linear LMS estimator, derived in a similar man- 
ner. It involves only means, variances, and covariances between various pairs of 
random variables. Also, if there are multiple parameters ©; to be estimated, we 
may consider the criterion 


^ 


E[(6: - 91)?] +--+» + B[(Om - Om)?], 


and minimize it over all estimators Ô, MT Om that are linear functions of the 
observations. This is equivalent to finding, for each i, a linear estimator O; that 
minimizes E[(6; — 61)?], so that we are essentially dealing with m decoupled 
linear estimation problems, one for each unknown parameter. 

In the case where there are multiple observations with a certain indepen- 
dence property, the formula for the linear LMS estimator simplifies as we will 
now describe. Let © be a random variable with mean j: and variance o2, and let 
X1,..., Xn be observations of the form 


X;=0+ Wi, 


where the W; are random variables with mean 0 and variance c2, which rep- 
resent observation errors. Under the assumption that the random variables 
©, W,...,W4, are uncorrelated, the linear LMS estimator of ©, based on the 
observations X4,..., Xn, turns out to be 


ulog + Xo? 
Ó = i=1 


e? 
i=0 
The derivation involves forming the function 
A(ai,...,@n,6) = E[(O — a3X1 — --- — an Xn — b)?], 


and minimizing it by setting to zero its partial derivatives with respect to 
01,...,Qn, b. After some calculation (given in the end-of-chapter problems), this 


results in 
u/og 1 Jo? 


aa 
e? 


i=0 


OS ee = 
X 1/0? 
i=0 


from which the formula for the linear LMS estimator given earlier follows. 


k j=l,...,n, 
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Linear Estimation and Normal Models 


The linear LMS estimator is generally different from and, therefore, inferior to 
the LMS estimator E[O | X;,..., Xn]. However, if the LMS estimator happens to 
be linear in the observations X,,..., Xn, then it is also the linear LMS estimator, 
i.e., the two estimators coincide. 

An important example where this occurs is the estimation of a normal ran- 
dom variable © on the basis of observations X; = © + Wi, where the W; are 
independent zero mean normal noise terms, independent of O. This is the same 
model as in Example 8.3, where we saw that the posterior distribution of O is 
normal and that the conditional mean is a linear function of the observations. 
Thus, the LMS and the linear LMS estimators coincide. Indeed, the formula for 
the linear LMS estimator given in this section is consistent with the expression 
for the posterior mean Ó in Example 8.3 (the notation u here corresponds to To 
in Example 8.3). This is à manifestation of a property that can be shown to 
hold more generally: if O, X,..., Xn are all linear functions of a collection of 
independent normal random variables, then the LMS and the linear LMS esti- 
mators coincide. They also coincide with the MAP estimator, since the normal 
distribution is symmetric and unimodal. 

The above discussion leads to an interesting interpretation of linear LMS 
estimation: the estimator is the same as the one that would have been obtained 
if we were to pretend that the random variables involved were normal, with 
the given means, variances, and covariances. Thus, there are two alternative 
perspectives on linear LMS estimation: either as a computational shortcut (avoid 
the evaluation of a possibly complicated formula for E[O | X]), or as a model 
simplification (replace less tractable distributions by normal ones). 


The Choice of Variables in Linear Estimation 


Let us point out an important difference between LMS and linear LMS estima- 
tion. Consider an unknown random variable ©, observations X4,..., Xn, and 
transformed observations Y; = h(X;), i = 1,...,n, where the function h is one- 
to-one. The transformed observations Y; convey the same information as the 
original observations X;, and therefore the LMS estimator based on Y;i,...,Y, 
is the same as the one based on X,,..., Xn: 


E[O | (X1), ..., (X4)] = E[O| Xi, ..., Xn]. 


On the other hand, linear LMS estimation is based on the premise that 
the class of linear functions of the observations X;,...,.X4 contains reasonably 
good estimators of O; this may not always be the case. For example, suppose 
that © is the unknown variance of some distribution and X;,..., Xn represent 
independent random variables drawn from that distribution. Then, it would be 
unreasonable to expect that a good estimator of O can be obtained with a linear 
function of X;,..., Xn. This suggests that it may be helpful to transform the 


8.5 
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observations so that good estimators of O can be found within the class of linear 
functions of the transformed observations. Suitable transformations may not 
always be obvious, but intuition into the structure of the given problem, may 
suggest some good choices: see Problem 17 for a simple example. 


SUMMARY AND DISCUSSION 


We have introduced statistical inference methods that aim to extract information 
about unknown variables or models from probabilistically related observations. 
We have focused on the case where the unknown is a (possibly multidimensional) 
parameter 0, and we have discussed hypothesis testing and estimation problems. 

We have drawn a distinction between the Bayesian and classical inference 
approaches. In this chapter, we have discussed Bayesian methods, which treat 
the parameter as a random variable O with known prior distribution. The key 
object of interest here is the posterior distribution of O given the observations. 
The posterior can in principle be calculated using Bayes' rule. although in prac- 
tice, this may be difficult. 

The MAP rule. which maximizes the posterior over 0. is a general inference 
method that can address both estimation and hypothesis testing problems. We 
discussed two other methods for parameter estimation: the LMS (or conditional 
expectation) estimator and the linear LMS estimator, both of which are based 
on minimization of the mean squared error between O and its estimate. The 
latter estimator results in higher mean squared error. but requires simple calcu- 
lations, involving only means. variances. and covariances of the parameters and 
observations. Under normality assumptions on the parameter and observations, 
the MAP and the two LMS estimators coincide. 
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PROBLEMS 





SECTION 8.1. Bayesian Inference and the Posterior Distribution 


Problem 1. Artemisia moves to a new house and she is “fifty-percent sure" that the 
phone number is 2537267. To verify this. she uses the house phone to dial 2537267. she 
obtains a busy signal. and concludes that this is indeed the correct number. Assuming 
that the probability of a typical seven-digit phone number being busy at any given time 
is 1%, what is the probability that Artemisia's conclusion was correct? 


Problem 2. Nefeli. a student in a probability class. takes a multiple-choice test with 
10 questions and 3 choices per question. For each question. there are two equally likely 
possibilities, independent of other questions: either she knows the answer, in which 
case she answers the question correctly. or else she guesses the answer with probability 
of success 1/3. 


(a) Given that Nefeli answered correctly the first question, what is the probability 
that she knew the answer to that question? 


(b) Given that Nefeli answered correctly 6 out of the 10 questions, what is the pos- 
terior PMF of the number of questions of which she knew the answer? 


SECTION 8.2. Point Estimation, Hypothesis Testing, and the MAP 
Rule 


Problem 3. The number of minutes between successive bus arrivals at Alvin's bus 
stop is exponentially distributed with parameter O, and Alvin's prior PDF of O is 


— f108. if € [0.1/5], 
fo(9) = Lo otherwise. 


(a) Alvin arrives on Monday at the bus stop and has to wait 30 minutes for the bus 
to arrive. What is the posterior PDF. and the MAP and conditional expectation 
estimates of O? 


(b) Following his Monday experience. Alvin decides to estimate O more accurately. 
and records his waiting times for five days. These are 30. 25. 15. 40. and 20 
minutes, and Alvin assumes that his observations are independent. What is the 
posterior PDF, and the MAP and conditional expectation estimates of O given 
the five-day data? 


Problem 4. Students in a probability class take a multiple-choice test with 10 
questions and 3 choices per question. A student who knows the answer to a question will 
answer it correctly, while a student that does not will guess the answer with probability 
of success 1/3. Each student is equally likely to belong to one of three categories 
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i — 1,2,3: those who know the answer to each question with corresponding probabilities 
0i, where 0, = 0.3, 62 = 0.7, and 63 = 0.95 (independent of other questions). Suppose 
that a randomly chosen student answers k questions correctly. 


(a) For each possible value of k, derive the MAP estimate of the category that this 
student belongs to. 


(b) Let M be the number of questions that the student knows how to answer. Derive 
the posterior PMF, and the MAP and LMS estimates of M given that the student 
answered correctly 5 questions. 


Problem 5. Consider a variation of the biased coin problem in Example 8.4, and 
assume the probability of heads, O, is distributed over [0, 1] according to the PDF 


fo(6) -2- 4|; 4], 6 € [0,1]. 


Find the MAP estimate of ©, assuming that n independent coin tosses resulted in k 
heads and n — k tails. 


Problem 6. Professor May B. Hard, who has a tendency to give difficult problems 
in probability quizzes, is concerned about one of the problems she has prepared for an 
upcoming quiz. She therefore asks her TA to solve the problem and record the solution 
time. May's prior probability that the problem is difficult is 0.3, and she knows from 
experience that the conditional PDF of her TA's solution time X, in minutes, is 


—0.04r : 
fre(r| 8 = 1) = (s otherwise, 


if O — 1 (problem is difficult), and is 


—0.16r : 
—9)— 2 2E , if5<2r< 60, 
ie =?) { 0 otherwise, 


if © = 2 (problem is not difficult), where c; and c2 are normalizing constants. She uses 
the MAP rule to decide whether the problem is difficult. 


(a) Given that the TA’s solution time was 20 minutes, which hypothesis will she 
accept and what will be the probability of error? 


(b) Not satisfied with the reliability of her decision, May asks four more TAs to solve 
the problem. The TAs’ solution times are conditionally independent and iden- 
tically distributed with the solution time of the first TA. The recorded solution 
times are 10, 25, 15, and 35 minutes. On the basis of the five observations, which 
hypothesis will she now accept, and what will be the probability of error? 


Problem 7. We have two boxes, each containing three balls: one black and two white 
in box 1; two black and one white in box 2. We choose one of the boxes at random, 
where the probability of choosing box 1 is equal to some given p, and then draw a ball. 


(a) Describe the MAP rule for deciding the identity of the box based on whether the 
drawn ball is black or white. 
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(b) Assuming that p — 1/2, find the probability of an incorrect decision and compare 
it with the probability of error if no ball had been drawn. 


Problem 8. The probability of heads of a given coin is known to be either qo 
(hypothesis Ho) or q (hypothesis H1). We toss the coin repeatedly and independently, 
and record the number of heads before a tail is observed for the first time. We assume 
that 0 < qo < qi < 1, and that we are given prior probabilities P(Ho) and P(Hi). For 
parts (a) and (b), we also assume that P(Ho) = P(A) = 1/2. 


(a) Calculate the probability that hypothesis Hi is true, given that there were exactly 
k heads before the first tail. 


(b) Consider the decision rule that decides in favor of hypothesis Hı if k > k*, where 
k* is some nonnegative integer, and decides in favor of hypothesis Ho otherwise. 
Give a formula for the probability of error in terms of k”, qo, and qi. For what 
value of k* is the probability of error minimized? Is there another type of decision 
rule that would lead to an even lower probability of error? 


(c) Assume that qo = 0.3, qı = 0.7, and P(Hi1) > 0.7. How does the optimal choice 
of k* (the one that minimizes the probability of error) change as P(Hi) increases 
from 0.7 to 1.0? 


Problem 9.* Consider a Bayesian hypothesis testing problem involving m hypothe- 
ses, and an observation vector X = (X1,..., Xn). Let ga (X1,..., Xn) be the decision 
resulting from the MAP rule based on Xi,..., Xn, and ga-1(Xi,..., Xn-1) the deci- 
sion resulting from the MAP rule based on X;,..., X4-1 (i.e., the MAP rule that uses 
only the first n — 1 components of the observation vector). Let r = (r1,...,T4) be the 
realized value of the observation vector, and let 


€n( 1,15) -P(9Z9i...,24) | Xi =21,...,Xn Sura), 
en=1(i,-++2a-1) = P(O  ga-i(zi,-. 22-1) | Xi = $1, Xa — $221); 


be the corresponding probabilities of error. Show that 
En(T1,..., En) € 6nciltzisca nel): 


so making the MAP decision with extra data cannot increase the probability of error. 


Solution. We view gn-1(X1,..., Xn-1) as a special case of a decision rule based on all 
components X;,...,.X4 of the observation vector. Since the MAP rule ga (X1,..., Xn) 
minimizes the probability of error over all decision rules based on Xi,..., Xn, the result 
follows. 


SECTION 8.3. Bayesian Least Mean Squares Estimation 


Problem 10. A police radar always overestimates the speed of incoming cars by 
an amount that is uniformly distributed between 0 and 5 miles/hour. Assume that 
car speeds are uniformly distributed between 55 and 75 miles/hour. What is the LMS 
estimate of a car's speed based on the radar's measurement? 


Problem 11. The number © of shopping carts in a store is uniformly distributed 
between 1 and 100. Carts are sequentially numbered between 1 and O. You enter 
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the store, observe the number X on the first cart you encounter, assumed uniformly 
distributed over the range 1,...,©, and use this information to estimate O. Find and 
plot the MAP estimator and the LMS estimator. Hint: Note the resemblance with 
Example 8.2. 


Problem 12. Consider the multiple observation variant of Example 8.2: given that 
© = 0, the random variables X,,..., Xn are independent and uniformly distributed 
on the interval [0,6], and the prior distribution of © is uniform on the interval [0, 1]. 
Assume that n > 3. 


(a) Find the LMS estimate of ©, given the values zi,..., 2n of Xi,..., Xn. 


(b) Plot the conditional mean squared error of the MAP and LMS estimators, as 
functions of T = max(zi,..., Tn}, for the case n = 5. 


(c) If z is held fixed at 7 = 0.5, how do the MAP and the LMS estimates, and the 
corresponding conditional mean squared errors behave as n — oo? 


Problem 13.* 


(a) Let Yi,...,Ya4 be independent identically distributed random variables and let 
Y =Y,+---+ Yn. Show that 


Y 


(b) Let © and W be independent zero-mean normal random variables, with posi- 
tive integer variances k and m, respectively. Use the result of part (a) to find 
E[O | O + W], and verify that this agrees with the conditional expectation for- 
mula in Example 8.3. Hint: Think of O and W as sums of independent random 
variables. 


(c) Repeat part (b) for the case where O and W are independent Poisson random 
variables with integer means À and p, respectively. 


Solution. (a) By symmetry, we see that E[Y; | Y] is the same for all i. Furthermore, 
E[Y; 4 --- - Yn |Y] = E[Y | Y] = Y. 


Therefore, E[Y; |Y] = Y/n. 


(b) We can think of O and W as sums of independent standard normal random vari- 
ables: 
O=0 + +9, | W-2Wi We. 


We identify Y with O -- W and use the result from part (a), to obtain 


O64-W 


E[6;|O + W] = aa 





Thus, 





E[9 |© + W] = EBf01+---+0.|0+ W] = (O +W). 


k+m 
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The formula for the conditional mean derived in Example 8.3, specialized to the 
current context (zero prior mean and a single measurement) shows that the conditional 
expectation is of the form 


(O+ W)/o$ _ c "- k 
JUGE RIUL] " eppopi eee Wa 


consistent with the answer obtained here. 


(c) We recall that the sum of independent Poisson random variables is Poisson. Thus 
the argument in part (b) goes through, by thinking of O and W as sums of A (respec- 
tively, p) independent Poisson random variables with mean one. We then obtain 


À 


SECTION 8.4. Bayesian Linear Least Mean Squares Estimation 


Problem 14. Consider the random variables © and X in Example 8.11. Find the 
linear LMS estimator of O based on X, and the associated mean squared error. 


Problem 15. For the model in the shopping cart problem (Problem 11), derive and 
plot the conditional mean squared error, as a function of the number on the observed 
cart, for the MAP, LMS, and linear LMS estimators. 


Problem 16. The joint PDF of random variables X and © is of the form 


C; if (x, 0) € S, 
fx.e(r.0) = ts otherwise, 


where c is a constant and S is the set 
S-((.6]0€zx2 0<0<2, z-1«0 xz). 


We want to estimate O based on X. 
(a) Find the LMS estimator g( X) of ©. 
(b) Calculate E[(6 -g(X)fIX- cae E[9(X)], and var (g(X)). 
(c) Calculate the mean squared error E [(e-g9(X))?] . Is it the same as E [var(© | X)] ? 
(d) Calculate var(O) using the law of total variance. 
(e) Derive the linear LMS estimator of O based on X , and calculate its mean squared 


error. 


Problem 17. Let © be a positive random variable, with known mean y and variance 
c?, to be estimated on the basis of a measurement X of the form X = VOW. We 
assume that W is independent of O with zero mean, unit variance, and known fourth 
moment E[Wt]. Thus, the conditional mean and variance of X given © are 0 and 
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©, respectively, so we are essentially trying to estimate the variance of X given an 
observed value. Find the linear LMS estimator of O based on X, and the linear LMS 
estimator of O based on X?. 


Problem 18. Swallowed Buffon's needle. A doctor is treating a patient who has 
accidentally swallowed a needle. The key factor in whether to operate on the patient 
is the length O of the needle, which is unknown, but is assumed to be uniformly 
distributed between 0 and l > 0. We wish to form an estimate of © based on X, its 
projected length in an X-ray. We introduce a two-dimensional coordinate system and 


write 
X = OcosW, 


where W is the acute angle formed by the needle and one of the axes. We assume that 
W is uniformly distributed in the interval [0, 7/2], and is independent from ©. 


(a) Find the LMS estimator E[O| X]. In particular, derive Fxje(r|0), fxje(z|6), 
fx(z), feix(0|z), and then compute E[OG| X = z]. Hint: You may find the 
following integration formulas useful: 


b b 
1 x a 
—_— = 2 2 —— = 
f 3 = da = log (a+ a e f a iis 


(b) Find the linear LMS estimate of © based on X, and the associated mean squared 
error. 


b 
ur 








a 


Problem 19. Consider a photodetector in an optical communications system that 
counts the number of photons arriving during a certain interval. A user conveys infor- 
mation by switching a photon transmitter on or off. Assume that the probability of 
the transmitter being on is p. If the transmitter is on, the number of photons trans- 
mitted over the interval of interest is a Poisson random variable O with mean A. If the 
transmitter is off, the number of photons transmitted is zero. 

Unfortunately, regardless of whether or not the transmitter is on or off, photons 
may still be detected due to a phenomenon called “shot noise.” The number N of 
detected shot noise photons is a Poisson random variable with mean p. Thus, the total 
number X of detected photons is equal to O + N if the transmitter is on, and is equal 
to N otherwise. We assume that N and © are independent, so that O + N is also 
Poisson with mean A + p. 


(a) What is the probability that the transmitter was on, given that the photodetector 
detected k photons? 


(c) Describe the MAP rule for deciding whether the transmitter was on. 
(d) Find the linear LMS estimator of the number of transmitted photons, based on 


the number of detected photons. 


Problem 20.* Estimation with spherically invariant PDFs. Let O and X be 
continuous random variables with joint PDF of the form 


fe.x (6,2) = h(a(6. z)). 
where h is a nonnegative scalar function, and q(0, z) is a quadratic function of the form 


q(0, z) = a(0 — 8)? + b(z — z)? — 2c(0 — 8)(z — x). 
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Here a, b, c, 0, X are some scalars with a # 0. Derive the LMS and linear LMS estimates, 
for any z such that E[O | X = z] is well-defined and finite. Assuming that q(6, x) > 0 
for all z, 0, and that h is monotonically decreasing, derive the MAP estimate and show 
that it coincides with the LMS and linear LMS estimates. 


Solution. The posterior is given by 


h(q(6, 
feix (8| 7) = EXE = E 


To motivate the derivation of the LMS and linear LMS estimates, consider first the 
MAP estimate, assuming that q(0,r) > 0 for all z, 0, and that h is monotonically 
decreasing. The MAP estimate maximizes h(q(,z)) and, since h is a decreasing 
function, it minimizes q(6,z) over 0. By setting to 0 the derivative of q(0,z) with 
respect to 0, we obtain 


6=6+ Ê(z - 7). 
a 


(We are using here the fact that a nonnegative quadratic function of one variable is 
minimized at a point where its derivative is equal to 0.) 

We will now show that 6 is equal to the LMS and linear LMS estimates [without 
the assumption that q(0,x) > 0 for all z, 0, and that h is monotonically decreasing]. 
We write 

6-6 =0-64 = (x - 7), 


and substitute in the formula for q(0, x) to obtain after some algebra 
^ c? 
q(0, z) = a(0 — 6)? + (» — £j (r —z)^. 


Thus, for any given z, the posterior is a function of 0 that is symmetric around 6. This 
implies that 0 is equal to the conditional mean E(O | X = z], whenever E[O | X = z] is 
well-defined and finite. Furthermore, we have 


E[O| X] 28 <(X - 2). 


Since E[O | X] is linear in X, it is also the linear LMS estimator. 


Problem 21.* Linear LMS estimation based on two observations. Consider 
three random variables O, X, and Y, with known variances and covariances. Assume 
that var(X) » 0, var(Y) » 0, and that |o(X, Y)| # 1. Give a formula for the linear 
LMS estimator of © based on X and Y, assuming that X and Y are uncorrelated, and 
also in the general case. 


Solution. We consider a linear estimator of the form © = aX + bY +c and choose a, 
b, and c to minimize the mean squared error E[(6 —aX —bY — o]. Suppose that a 


and b have already been chosen. Then, c must minimize E[(O —aX — bY — c)?], SO 


c = E[] - aE[X] — bE|Y]. 
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It follows that a and b minimize 
E (te - E[9]) - (X — E[X]) - (Y — Ep). 


We may thus assume that O, X, and Y are zero mean, and in the final formula subtract 
the means. Under this assumption, the mean squared error is equal to 


E[(© -aX —bY)*| = E[e?] + a?E[X?] + ?E[Y?] - 2aE(6 X] - 2»E[OY] + 2abE[ XY]. 


Assume that X and Y are uncorrelated, so that E[XY] = E[X] E[Y] = 0. We 
differentiate the expression for the mean squared error with respect to a and b, and set 
the derivatives to zero to obtain 


_ E[OX]  cov(O, X) _ E[OY] _ cov(9,Y) 
~ E[X2]  va(X) ' X E[Y?]  var(Y) 








Thus, the linear LMS estimator is 


cov(O, Y) 


D (Y — E[Y]). 


) 
——— — (X - E[X]) + 
MOD - Eix) 
If X and Y are correlated, we similarly set the derivatives of the mean squared er- 
ror to zero. We obtain and then solve a system of two linear equations in the unknowns 
a and b, whose solution is 


_ var(Y )cov(O, X) — cov(9, Y )cov(X, Y) 
Y var(X)var(Y) — cov?(X, Y) : 


b= var(X)cov(O©,Y ) — cov(©, X)cov( X,Y) 

> var (X)var(Y) — cov?(X,Y) ` 

Note that the assumption |o(X, Y)| # 1 guarantees that the denominator in the pre- 
ceding two equations is nonzero. 


Problem 22.* Linear LMS estimation based on multiple observations. Let O 
be a random variable with mean p and variance oĉ, and let X1,..., Xn be observations 
of the form 

Xi =04+Wi, 


where the observation errors W; are random variables with mean 0 and variance c2. 
We assume that the random variables ©, W1,...,W 4 are independent. Verify that the 
linear LMS estimator of © based on X1,..., Xn is 


n 
u/o8 + Y Xi/o? 
A i=1 
o = rr S — — 1 
2 1o? 
i-0 
by minimizing over a1,...,05,b the function 


1 
h(ai,...,@n,6) = 5E[( - «iX — +++ — anXn — b).]. 
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Solution. We will show that the minimizing values of a1,...,a4.b are 
2 2 
* u/c0 * 1/0} P 
b -—————; aj = ——3—, j —bl...n 
2 2 
Du Saye 
i=0 i=0 


To this end, it is sufficient to show that the partial derivatives of h, with respect to 
Qi,.--,@n,b, are all equal to 0 when evaluated at aj,...,a7,b". (Because the quadratic 
function h is nonnegative, it can be shown that any point at which its derivatives are 
zero must be a minimum.) 

By differentiating h, we obtain 











-e| (ora 1j e Yemen] 
a*.b* i=1 i=l 


From the expressions for b* and a;, we see that 


n 


. b* 
y» -l= E 


i=l 
Using this equality and the facts 
Ejo]-» E(w] =0, 


b - 
=E —-—|0+ a; Wi + b* — D. 
(-E) e+ yarn + 


i=l 


it follows that 





a*,b* 
Using, in addition, the equations 
E[X.(u - 6)] = E[(© - u + W: + u)(u - 9)) = —eà. 
E[(X;W] = E[(© + W)W;] 2o?, for all å, 
E[X;W:] = E[(© + W;)W:] = 0, for all i and j with i Æ j, 


we obtain 
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where the last equality holds in view of the definitions of aj and b". 


Problem 23.* Properties of LMS estimation. Let O and X be two random 
variables with positive variances. Let Oz be the linear LMS estimator of © based on 
X, and let Oz = Ó; — © be the associated error. Similarly, let Ó be the LMS estimator 
E[O | X] of © based on X, and let © = © — O be the associated error. 


(a) Show that the estimation error Õz satisfies 
E[6z] = 0. 


(b) Show that the estimation error Õz is uncorrelated with the observation X. 
(c) Show that the variance of O can be decomposed as 
var(®) = var(®) + var(O1). 
(d) Show that the LMS estimation error Ó is uncorrelated with any function h(X) 
of the observation X. 
(e) Show that © is not necessarily independent from X. 


(f) Show that the linear LMS estimation error Oz is not necessarily uncorrelated 
with every function h(X) of the observation X, and that E[O; | X = z] need not 
be equal to zero for all z. 


Solution. (a) We have 


e X) ( 


= E[9] + X — E[X]). 


Taking expectations of both sides, we obtain E[9z] = E[6], or E[6;] = 0. 
(b) Using the formula for Oz, we obtain 


E[(6; - 6)x] = e | (z1) + x - six] ) X- ex| 


-E [zie X+ oa) (X? - XE[X]) - ex] 
Tx 
= EXB?) _ (e O (EX) _ (o - BoE) 
ex OX 
— cov(O, X) es E) ) 
= cov(O, X) (2 = ) 
ox 


The fact E[ÕLX] = 0 we just established, together with the fact E[6z] = 0 from 
part (a), imply that 


cov(Ór, X) = E[61X] - E[9z] E[X] = 0. 
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(c) Since cov(O;, X) = 0 and Oy is a linear function of X, we obtain cov(O 7, ÔL) = 0. 
Thus, 


var(9) = var(Ó, — ÕL) = var(Ó,) + var(-O,1) + 2cov(Ó,, - O1) 
= var(Ó,) + var(ÕL) — 2cov(1, ÔL) = var(Ó,) + var(OL). 
(d) This is because E[O] = 0 and 
E[6n(X)] = E[(E[e1X] - e)h(X)] 
E[e| X]h(X)] - E[en(x)] 
E| 


E 
E 
E 
E 
0. 


= E[E[e^(X) | X]] - E[en(x)] 
= E[en(x)] - E[0A(X)] 
(e) Let O and X be discrete random variables with joint PMF 
_ f 1/3, for (06,2) —2 (0,0), (1,1), (1,1), 
Pe,x (6,2) = { 0, otherwise. 


In this example, we have X = |O], so that X and © are not independent. Note that 
E[O |X = z] = 0 for all possible values z, so E[O | X] = 0. Thus, we have © = —©. 
Since O and X are not independent, O and X are not independent either. 


(f) Let O and X be discrete random variables with joint PMF 


..] V3, for (6,2) — (0,0), (1,1), (1, —1), 
po,x (8,2) = ia otherwise. 
In this example, we have © = |X|. Note that E[X] = 0 and E[OX] = s so that X and 
© are uncorrelated. We have Ô; = E[O] = 2/3, and &,- (2/3) - © = (2/3) - |X], 


which is not independent from X. Furthermore, we have E[O; | X = E (2/3) — |z|, 
which takes the different values 2/3 and —1/3, depending on whether z = 0 or |z| = 1. 


Problem 24.* Properties of linear LMS estimation based on multiple obser- 
vations. Let ©, X1,..., Xn be random variables with given variances and covariances. 
Let ©, be the linear LMS estimator of © based on X,..., Xn, and let ÔL = Ô; —O 
be the associated error. Show that E[Ó;] = 0 and that ©, is uncorrelated with Xi for 
every 2. 

Solution. We start by showing that E[Ó,X;] = 0, for all i. Consider a new linear 
estimator of the form Ô, +aX,, where a is a scalar parameter. Since Oy isa linear LMS 
estimator, its mean squared error E[(Ox — 0) 3 is no larger than the mean squared 
error h(a) = E[(Ó. TaXi- ey] of the new estimator. Therefore, the function h(a) 
attains its minimum value when a = 0, and (dh/da)(0) = 0. Note that 


h(a) = E[(Oz + aX;] = E[61] + aB[OLX,] + o? E[X?]. 


The condition (dh/da)(0) = 0 yields E[6, X;] = 0. 

Let us now repeat the above argument, but with the constant 1 replacing the 
random variable X;. Following the same steps, we obtain E[O;] = 0. Finally, note 
that 

cov(Ó;, Xi) = E[OrX;] - E[r] E[X.] = 0 - 0- E[X,] = 0, 


so that OL and X, are uncorrelated. 
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In the preceding chapter, we developed the Bayesian approach to inference, where 
unknown parameters are modeled as random variables. In all cases we worked 
within a single, fully-specified probabilistic model, and we based most of our 
derivations and calculations on judicious application of Bayes' rule. 

By contrast, in the present chapter we adopt a fundamentally different phi- 
losophy: we view the unknown parameter Ó as a deterministic (not random) but 
unknown quantity. The observation X is random and its distribution px (z;0) 
[if X is discrete] or fx(z;6) [if X is continuous] depends on the value of 6 (see 
Fig. 9.1). Thus, instead of working within a single probabilistic model, we will 
be dealing simultaneously with multiple candidate models, one model for each 
possible value of 0. In this context, a “good” hypothesis testing or estimation 
procedure will be one that possesses certain desirable properties under every 
candidate model. that is, for every possible value of 0. In some cases, this may 
be considered to be a worst case viewpoint: a procedure is not considered to 
fulfill our specifications unless it does so against the worst possible value that 6 
can take. 


t 
! Point Estimates i 

Observation £ | Hypothesis selection | 
i 

I 

l 

| 


Process ! Confidence intervals 
etc. 





Figure 9.1. Summary of a classical inference model. For each value of 8, we have 
a distribution px (z;@). The value z of the observation X is used to compute a 
point estimate, or select a hypothesis, etc. 


Our notation will generally indicate the dependence of probabilities and 
expected values on 0. For example, we will denote by Eg [A(X)] the expected 
value of a random variable h(X) as a function of 0. Similarly, we will use the 
notation Pg(A) to denote the probability of an event A. Note that this only 
indicates a functional dependence, not conditioning in the probabilistic sense. 

The first two sections focus on parameter estimation. with a special empha- 
sis on the maximum likelihood and the linear regression methods, often involving 
independent, identically distributed (i.i.d.) observations. The issues here are si- 
milar to the ones in Bayesian estimation, discussed in the preceding chapter. 
We are interested in estimators (functions of the observations) that have some 
desirable properties. However, the criteria for desirability are somewhat different 
because they should be satisfied for all possible values of the unknown param- 
eter. For example, we may require that the expected value of the estimation 
error be zero. or that the estimation error be small with high probability, for all 
possible values of the unknown parameter. 
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The third section deals with binary hypothesis testing problems. Here, 
we develop methods that bear similarity with the (Bayesian) MAP method, 
discussed in the preceding chapter. In particular, we calculate the "likelihood" 
of each hypothesis under the observed data, and we choose a hypothesis by 
comparing the ratio of the two likelihoods with a suitably chosen threshold. 

The last section addresses different types of hypothesis testing problems. 
For an example. suppose a coin is tossed n times, the resulting sequence of 
heads and tails is observed. and we wish to decide whether the coin is fair or 
not. The main hypothesis that we wish to test is whether p — 1/2, where p 
denotes the unknown probability of heads. The alternative hypothesis (p # 1/2) 
is composite, in the sense that it consists of several, possibly infinitely many, 
subhypotheses (e.g., p — 0.1. p — 0.4999. etc.). It is clear that no method can 
reliably distinguish between a coin with p — 0.5 and a coin with p — 0.4999 
on the basis of a moderate number of observations. Such problems are usually 
approached using the methodology of significance testing. Here, one asks the 
question: “are the observed data compatible with the hypothesis that p = 0.5?” 
Roughly speaking, a hypothesis is rejected if the observed data are unlikely to 
have been generated “accidentally” or “by chance.” under that hypothesis. 





Major Terms, Problems, and Methods in this Chapter 


e Classical statistics treats unknown parameters as constants to be 
determined. A separate probabilistic model is assumed for each pos- 
sible value of the unknown parameter. 


e In parameter estimation, we want to generate estimates that are 
nearly correct under any possible value of the unknown parameter. 


e In hypothesis testing, the unknown parameter takes a finite number 
m of values (m 2 2), corresponding to competing hypotheses; we want 
to choose one of the hypotheses, aiming to achieve a small probability 
of error under any of the possible hypotheses. 


e [n significance testing, we want to accept or reject a single hypoth- 
esis, while keeping the probability of false rejection suitably small. 


e Principal classical inference methods in this chapter: 


(à) Maximum likelihood (ML) estimation: Select the parame- 
ter that makes the observed data “most likely,” i.e., maximizes 
the probability of obtaining the data at hand (Section 9.1). 


(b) Linear regression: Find the linear relation that matches best 
a set of data pairs, in the sense that it minimizes the sum of 
the squares of the discrepancies between the model and the data 
(Section 9.2). 
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(c) Likelihood ratio test: Given two hypotheses, select one based 
on the ratio of their “likelihoods,” so that certain error probabil- 
ities are suitably small (Section 9.3). 


(d) Significance testing: Given a hypothesis, reject it if and only if 
the observed data falls within a certain rejection region. This re- 
gion is specially designed to keep the probability of false rejection 
below some threshold (Section 9.4). 





9.1 CLASSICAL PARAMETER ESTIMATION 


In this section. we focus on parameter estimation, using the classical approach 
where the parameter 0 is not random. but is rather viewed as an unknown con- 
stant. We first introduce some definitions and associated properties of estima- 
tors. We then discuss the maximum likelihood estimator. which may be viewed 
as the classical counterpart of the Bayesian MAP estimator. We finally focus on 
the simple but important example of estimating an unknown mean, and possibly 
an unknown variance. We also discuss the associated issue of constructing an 
interval that contains the unknown parameter with high probability (a “confi- 
dence interval"). The methods that we develop rely heavily on the laws of large 
numbers and the central limit theorem (cf. Chapter 5). 


Properties of Estimators 


Given observations 5 0E Mio CP X4). an estimator is a random variable of 
the form © = g( X), for some function g. Note that since the distribution of 
X depends on 0. the same is true for the distribution of ©. We use the term 
estimate to refer to an actual realized value of O. 

Sometimes, particularly when we are interested in the role of the number 
of observations n, we use the notation O, for an estimator. It is then also 
appropriate to view O, as a sequence of estimators (one for each value of n). 
The mean and variance of Ó, are denoted Eg[Ó;.] and vare(On), respectively. 
and are defined in the usual way. Both Eg[O;,] and varg(O,) are numerical 
functions of 0. but for simplicity. when the context is clear we sometimes do not 
show this dependence. 

We introduce some terminology related to various properties of estimators. 





Terminology Regarding Estimators 


Let ©, be an estimator of an unknown parameter 0, that is, a function of 
n observations X,,...,X» whose distribution depends on 0. 
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e The estimation error, denoted by O,, is defined by On = On — 6. 


e The bias of the estimator, denoted by be(On), is the expected value 
of the estimation error: 


be(On) = Es[On] — 6. 


e The expected value, the variance, and the bias of Ó, depend on 6, 
while the estimation error depends in addition on the observations 
POTE Cn 


e We call Ó, unbiased if Eo[Ó,,] = 6, for every possible value of 6. 


e We call 0, asymptotically unbiased if lim; 4, Ee[Ó,] = 6, for 
every possible value of 0. 


e We call On consistent if the sequence On converges to the true value 
of the parameter 0, in probability, for every possible value of 0. 





An estimator, being a function of the random observations. cannot be ex- 
pected to be exactly equal to the unknown value 0. Thus, the estimation error 
will be generically nonzero. On the other hand, if the average estimation error 
is zero. for every possible value of 0, then we have an unbiased estimator, and 
this is a desirable property. Asymptotic unbiasedness only requires that the es- 
timator become unbiased as the number n of observations increases. and this is 
desirable when n is large. 

Besides the bias be(On), we are usually interested in the size of the estima- 
tion error. This is captured by the mean squared error E,[03]. which is related 
to the bias and the variance of Ó, according to the following formula:t 


E,[03] = b3(On) + varg(Ó, ). 


This formula is important because in many statistical problems. t here is a trade- 
off between the two terms on the right-hand-side. Often a reduction in the 
variance is accompanied by an increase in the bias. Of course. a good estimator 
is onc that manages to keep both terms small. 

We will now discuss some specific estimation approaches. starting with 
maximun) likelihood estimation. This is a general method that bears similarity 
to MAP estimation. introduced in the context of Bayesian inference. We will 
subsequently consider the simple but important case of estimating the mean 
and variance of a random variable. This will bring about a connection with our 
discussion of the laws of large numbers in Chapter 5. 


t This is an application of the formula E[X?] = (E[X])^ + var( X). with X = 6, 
and where the expectation is taken with respect to the distribution corresponding to 6: 
we are also using the facts Eg[O,] = be(O,) and varg(O;) = vare(On — 0) = vare(On). 
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Maximum Likelihood Estimation 

Let the vector of observations X = (Xi,..., Xn) be described by a joint PMF 
px(z;0) whose form depends on an unknown (scalar or vector) parameter 6. 
Suppose we observe a particular value x = (xij,...,r4) of X. Then, a maxi- 
mum likelihood (ML) estimate is a value of the parameter that maximizes the 
numerical function px(z1,-..,2n}@) over all 8 (see Fig. 9.2): 


Bn = arg max px (7). Satta di. 


For the case where X is continuous, the same approach applies with px (zx; 6) 
replaced by the joint PDF fx(z;6@), so that 


6, = arg max fx (z1.. 2n; 0). 


We refer to px (2:0) [or fx (2:0) if X is continuous] as the likelihood function. 








Observation Max | ML Estimate ğ 


Process 





Px (X: bm) 


Figure 9.2. Illustration of ML estimation, assuming X is discrete and ĝ takes 
one of the m values 81,.... öm. Given the value of the observation X = z, the 


values of the likelihood function px (z;0,) become available for all i, and a value 
of 0 that maximizes px (z; @) is selected. 


In many applications, the observations X; are assumed to be independent, 
in which case, the likelihood function is of the form 


(for discrete X;). In this case, it is often analytically or computationally conve- 
nient to maximize its logarithm, called the log-likelihood function, ; 


n n 
log px (21, .. 22:0) = log | | px;(2i;6) = 5 log px, (25:0), 


i=] i=] 
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over 0. When X is continuous, there is a similar possibility, with PMFs replaced 
by PDFs: we maximize over Ó the expression 


log fx (zi... 22:0) = log [| fx, (2456) = Y log fx, (2:0). 
i-1 i-1 


The term "likelihood" needs to be interpreted properly. In particular, hav- 
ing observed the value x of X, px(z;0) is not the probability that the unknown 
parameter is equal to 0. Instead, it is the probability that the observed value x 
can arise when the parameter is equal to 0. Thus, in maximizing the likelihood, 
we are asking the question: “What is the value of 0 under which the observations 
we have seen are most likely to arise?" 

Recall that in Bayesian MAP estimation, the estimate is chosen to maxi- 
mize the expression pe(0) px|e(x |0) over all 6, where pe(0) is the prior PMF 
of an unknown discrete parameter 6. Thus, if we view px(z;0) as a conditional 
PMF, we may interpret ML estimation as MAP estimation with a flat prior, 
i.e., a prior which is the same for all 0, indicating the absence of any useful 
prior knowledge. Similarly, in the case of continuous Ó with a bounded range 
of possible values, we may interpret ML estimation as MAP estimation with a 
uniform prior: fe(0) = c for all 6 and some constant c. 


Example 9.1. Let us revisit Example 8.2, in which Juliet is always late by an 
amount X that is uniformly distributed over the interval [0,6], and 0 is an unknown 
parameter. In that example. we used a random variable © with flat prior PDF fe(0) 
(uniform over the interval (0, 1]) to model the parameter. and we showed that the 
MAP estimate is the value z of X. In the classical context of this section, there is 
no prior, and @ is treated as a constant, but the ML estimate is also à — z. 


Example 9.2. Estimating the Mean of a Bernoulli Random Variable. We 
want to estimate the probability of heads, 0, of a biased coin, based on the outcomes 
of n independent tosses X),....Xn (Xi = 1 for a head, and X; = 0 fora tail). This 
is similar to the Bayesian setting of Example 8.8, where we assumed a flat prior. 
We found there that the peak of the posterior PDF (the MAP estimate) is located 
at 0 = k/n, where k is the number of heads observed. It follows that k/n is also 
the ML estimate of 0, so that the ML estimator is 

6, = Sebi tA 

n 


^ 


This estimator is unbiased. It is also consistent, because O4 converges to 0 in 
probability, by the weak law of large numbers. 

It is interesting to compare the ML estimator with the LMS estimator, ob- 
tained with a Bayesian approach in Example 8.8. We showed there that with a 
flat prior, the posterior mean is (k + 1)/(n + 2). Thus, the ML estimate, k/n, is 
similar but not identical to the LMS estimate obtained with the Bayesian approach. 
However, as n — oo. the two estimates asymptotically coincide. 
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Example 9.3. Estimating the Parameter of an Exponential Random Vari- 
able. Customers arrive to a facility, with the ith customer arriving at time Y;. We 
assume that the ith interarrival time, X; = Yi — Yi-1 (with the convention Yo = 0) 
is exponentially distributed with unknown parameter 6, and that the random vari- 
ables X1,..., Xn are independent. (This is the Poisson arrivals model, studied in 
Chapter 6.) We wish to estimate the value of 0 (interpreted as the arrival rate), on 
the basis of the observations X;,..., Xn. 
The corresponding likelihood function is 


fx(z;80) = TTA «0 = Ios". 
1-1 i=l 


and the log-likelihood function is 
log fx (x; 0) = nlog 6 — ys, 


where 
n 
Un = ) Ti. 


The derivative with respect to 0 is (n/0) — Yn, and by setting it to 0, we see that 
the maximum of log fx (z;0), over 6 > 0, is attained at 0, = n/yn. The resulting 
estimator is 


It is the inverse of the sample mean of the interarrival times, and it can be inter- 
preted as an empirical arrival rate. 

Note that by the weak law of large numbers, Yn/n converges in probability 
to E[X.] = 1/0, as n — oc. This can be used to show that On converges to 0 in 
probability, so the estimator is consistent. 


Our discussion and examples so far have focused on the case of a single 
unknown parameter 0. The next example involves a two-dimensional parameter. 


Example 9.4. Estimating the Mean and Variance of a Normal. Consider 
the problem of estimating the mean yz and variance v of a normal distribution using 
n independent observations X;,..., Xn. The parameter vector here is 6 = (p, v). 
The corresponding likelihood function is 


- 1 Te 
fx Giu, v) - I (xii p, v ) - [I ec ice) Jav. 


27Tv 





i=1 


After some calculation it can be written ast 


2 


fea payee xP { — FB} exp { - ans), 
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where m4 is the realized value of the random variable 
1 n 
— J Xi, 
n 
i=1 


and s2 is the realized value of the random variable 


The log-likelihood function is 


2 2 
ns;  n(m,- 
log fx (zi u,v) = E : log(27) — —-logv— UE n(n — n) i 


Setting to zero the derivatives of this function with respect to u and v, we obtain 
the estimate and estimator, respectively, 


Ôn = (mn. 82), On = (Mn, Sa). 


. . —-2 . = ” 
Note that Mn is the sample mean, while S, may be viewed as a “sample variance. 
: z? . a. 
As will be shown shortly, Ee[S n] converges to v as n increases, so that S, is asymp- 
totically unbiased. Using also the weak law of large numbers, it can be shown that 
=2 : i 
Mn and S, are consistent estimators of u and v, respectively. 


Maximum likelihood estimation has some appealing properties. For exam- 
ple, it obeys the invariance principle: if O, is the ML estimate of 0, then for 
any one-to-one function h of 0, the ML estimate of the parameter ¢ = h(6) is 
h(Ó4). Also. when the observations are i.i.d., and under some mild additional 
assumptions, it can be shown that the ML estimator is consistent. 

Another interesting property is that when @ is a scalar parameter, then 
under some mild conditions, the ML estimator has an asymptotic normality 
property. In particular. it can be shown that the distribution of (On —0)/o(Ó, ). 
where c?(Ó,) is the variance of On, approaches a standard normal distribution. 
Thus, if we are able to also estimate 0(O,), we can use it to derive an error vari- 
ance estimate based on a normal approximation. When @ is a vector parameter, 
a similar statement applies to each one of its components. 


T To verify this. write for i = 1,...,n, 
(a, — H)? = (zi — ma + mn — un)? = (zi — ma)? + (ma — u)? + 2(zi — ma)(ma — n). 


sum over i, and note that 


n 


* s — ma)(m. — H) = (Mn — NC i—ma)- 0. 


1—1 i=l 
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Maximum Likelihood Estimation 
e We are given the realization z = (2),...,%n) of a random vector 
X = (Xi,..., Xn), distributed according to a PMF px(z;0) or PDF 
fx(z; 0). 
The maximum likelihood (ML) estimate is a value of 0 that maximizes 
the likelihood function, px (z;6) or fx(z;0), over all 0. 


The ML estimate of a one-to-one function h(9) of 0 is h(On), where On 
is the ML estimate of 0 (the invariance principle). 


When the random variables X; are i.i.d., and under some mild addi- 
tional assumptions, each component of the ML estimator is consistent 
and asymptotically normal. 





Estimation of the Mean and Variance of a Random Variable 


We now discuss the simple but important problem of estimating the mean and 
the variance of a probability distribution. This is similar to the preceding Exam- 
ple 9.4, but in contrast with that example, we do not assume that the distribution 
is normal. In fact, the estimators presented here do not require knowledge of the 
distributions px (z; 0) [or fx(z;0) if X is continuous]. 

Suppose that the observations X)...., Xn arei.i.d.. with an unknown com- 
mon mean 0. The most natural estimator of 0 is the sample mean: 


Xi t-:--+Xyn 
EE E ; 


Mhn = 


This estimator is unbiased, since Eọ[Mn] = Eg|X] = 0. Its mean squared error 
is equal to its variance, which is v/n, where v is the common variance of the X;. 
Note that the mean squared error does not depend on @. Furthermore, by the 
weak law of large numbers, this estimator converges to 0 in probability, and is 
therefore consistent. 

The sample mean is not necessarily the estimator with the smallest vari- 
ance. For example, consider the estimator O5 = 0. which ignores the observa- 
tions and always yields an estimate of zero. The variance of ©, is zero, but its 
bias is bg(O4) = —0. In particular, the mean squared error depends on 0 and is 
equal to 62. , 

The next example compares the sample mean with a Bayesian MAP esti- 
mator that we derived in Section 8.2 under certain assumptions. 


Example 9.5. Suppose that the observations X1,..., Xn are normal, i.i.d., with 
an unknown common mean 0, and known variance v. In Example 8.3, we used 
a Bayesian approach, and assumed a normal prior distribution on the parameter 
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0. For the case where the prior mean of 0 was zero, we arrived at the following 


estimator: 
Xi +--+ Xn 


n+1 


This estimator is biased, because Eg [Ôn] = nO/(n + 1) and be(Ó4) = —6/(n + 1). 
However, limn—o be(On) = 0, so O4 is asymptotically unbiased. Its variance is 


Ôn = 


Un 


vare(On) = (n+ 1) 


and it is slightly smaller than the variance v/n of the sample mean. Note that in 
the special case of this example, varg(O,) is independent of 0. The mean squared 
error is equal to 


8? vn 


E,(9?] = b3(On) + vare(On) = (n+ 1)? + m+? 


Suppose that in addition to the sample mean/estimator of 6, 


(Xi Xn 
n , 


Mn 


we are interested in an estimator of the variance v. A natural one is 


n 


1 
$5 = m y (Xi = M,)?, 


i-1 


which coincides with the ML estimator derived in Example 9.4 under a normality 
assumption. 
Using the facts 


vU 
E(6)[Mn]=9, Bey [X7]= +2,  Bo[Ma] = 0 + —, 


we have 


Evo) [55] = LEen) IX — 2M, $:X. -nMZ 











1 nr 
= E(e,.) 2 x —2M2+ Mal 
i=1 
— 1 : 2 2 
i=1 





=@2+v-(#+—) 


n-l 
= v. 
n 
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Thus, z; is not an unbiased estimator of v, although it is asymptotically unbi- 
ased. 

We can obtain an unbiased variance estimator after some suitable scaling. 
This is the estimator 








The preceding calculation shows that 
Ete. [58] = v, 


so $ 2 is an unbiased estimator of v, for all n. However. for large n, the estimators 
A —2 . 
5S2 and Sp are essentially the same. 


Estimates of the Mean and Variance of a Random Variable 


Let the observations X1,...,Xn be ii.d., with mean @ and variance v that 
are unknown. 


e The sample mean 
X posu 


n 


Mn = 


is an unbiased estimator of 0, and its mean squared error is v/n. 


e Two variance estimators are 


n—1l^4 
1-1 


$;-l1305-MQn $= 
i=1 


: a2 are à à i 
e The estimator 5, coincides with the ML estimator if the X; are nor- 
mal. It is biased but asymptotically unbiased. The estimator 52 is 
unbiased. For large n, the two variance estimators essentially coincide. 





Confidence Intervals 


Consider an estimator Ôn of an unknown parameter 6. Besides the numerical 
value provided by an estimate. we are often interested in constructing a so-called 
confidence interval. Roughly speaking, this is an interval that contains 0 with a 
certain high probability, for every possible value of 0. 

For a precise definition, let us first fix a desired confidence level, 1 — o, 
where a is typically a small number. We then replace the point estimator On by 
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a lower estimator On and an upper estimator Ô$, designed so that On < Ój, 


and . . 
P(O, <0 <07) 21-a, 


for every possible value of 8. Note that, similar to estimators, Óz and et , are 
functions of the observations, and hence random variables whose distributions 
depend on 6. We call [07 , On] a 1 — a confidence interval. 


Example 9.6. Suppose that the observations X; are i.i.d. normal, with unknown 
mean @ and known variance v. Then, the sample mean estimator 


POE 
n 


Ôn = 


is normal, t with mean @ and variance v/n. Let o = 0.05. Using the CDF ®(z) of the 
standard normal (available in the normal tables), we have ®(1.96) = 0.975 = 1—a/2 


and we obtain 
Po (S < L) — 0.95. 


v/n 


We can rewrite this statement in the form 


Ps (ôn a 1.96,/= «0«Ó, 4 1.964/2) = 0.95, 


which implies that 


[ên - 196,/2, Ôn + 1.964/2] 
n n 


is a 95% confidence interval, where we identify Ô; and ôt with ©, — 1.96 y v/n 
and On + 1.96 V/ v/n, respectively. 


In the preceding example, we may be tempted to describe the concept of 
a 9596 confidence interval by a statement such as "the true parameter lies in 
the confidence interval with probability 0.95." Such statements, however, can 
be ambiguous. For example, suppose that after the observations are obtained, 
the confidence interval turns out to be [—2.3, 4.1]. We cannot say that 6 lies in 
[-2.3,4.1] with probability 0.95, because the latter statement does not involve 
any random variables; after all, in the classical approach, 0 is a constant. In- 
stead, the random entity in the phrase “the true parameter lies in the confidence 
interval" is the confidence interval, not the true parameter. 

For a concrete interpretation, suppose that 0 is fixed. We construct a 
confidence interval many times, using the same statistical procedure, i.e., each 


tł We are using here the important fact that the sum of independent normal 
random variables is normal (see Chapter 4). 
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time, we obtain an independent collection of n observations and construct the 
corresponding 95% confidence interval. We then expect that about 95% of these 
confidence intervals will include 0. This should be true regardless of what the 
value of @ is. 


Confidence Intervals 
e A confidence interval for a scalar unknown parameter Ó is an interval 
whose endpoints Oz; and OZ bracket 0 with a given high probability. 


e Ô; and Ôf are random variables that depend on the observations 
X1 Jeee X ne 


e A 1 — a confidence interval is one that satisfies 


Po (Ô; <0 < Ot) > 1-a, 


for all possible values of 0. 





Confidence intervals are usually constructed by forming an interval around 
an estimator On (cf. Example 9.6). Furthermore, out of a variety of possible 
confidence intervals, one with the smallest possible width is usually desirable. 
However, this construction is sometimes hard because the distribution of the 
error Ôn — 6 is either unknown or depends on 0. Fortunately, for many important 
models, On — 0 is asymptotically normal and asymptotically unbiased. By this 
we mean that the CDF of the random variable 


^ 


04,—9 


A vare (On) 


approaches the standard normal CDF as n increases, for every value of 0. We may 
then proceed exactly as in our earlier normal example (Example 9.6), provided 
that varg(On) is known or can be approximated, as we now discuss. 


Confidence Intervals Based on Estimator Variance Approximations 


Suppose that the observations X; are i.i.d. with mean 0 and variance v that are 
unknown. We may estimate 0 with the sample mean 


x AE ets a 
Qs EE A 
n 


and estimate v with the unbiased estimator 
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that we introduced earlier. In particular, we may estimate the variance v/n of 
the sample mean by $2 /n. Then, for a given o, we may use these estimates 
and the central limit theorem to construct an (approximate) 1 — o confidence 
interval. This is the interval 


^ ^ 


Jm" 


where z is obtained from the relation 


On -z 


and the normal tables, and where Ss is the positive square root of $2. For 
instance, if a = 0.05, we use the fact (1.96) = 0.975 = 1 — o/2 (from the 
normal tables) and obtain an approximate 9596 confidence interval of the form 


^ ^ 


^ On. Los Sn 
On — 1.96 Vn’ On + 1.96 Vn i 


Note that in this approach, there are two different approximations in effect. 
First, we are treating On as if it were a normal random variable; second, we are 
replacing the true variance v/n of Ó;, by its estimate $2/n. 

Even in the special case where the X; are normal random variables, the 
confidence interval produced by the preceding procedure is still approximate. 
The reason is that $2 is only an approximation to the true variance v, and the 


random variable : 
p. 2 Vf ($a - 0) 


A 


Sn 

is not normal. However, for normal Xj, it can be shown that the PDF of Tan 
does not depend on 0 and v, and can be computed explicitly. It is called the 
t-distribution with n — 1 degrees of freedom. Like the standard normal 
PDF, it is symmetric and bell-shaped, but it is a little more spread out and has 
heavier tails (see Fig. 9.3). The probabilities of various intervals of interest are 
available in tables, similar to the normal tables. Thus, when the X; are normal 
(or nearly normal) and n is relatively small, a more accurate confidence interval 
is of the form ' 

Cpa 


t The £-distribution has interesting properties and can be expressed in closed 
form, but the precise formula is not important for our purposes. Sometimes it is called 
the “Student’s distribution.” It was published in 1908 by William Gosset while he was 
employed at a brewery in Dublin. He wrote his paper under the pseudonym Student 
because he was prohibited from publishing under his own name. Gosset was concerned 
with the selection of the best yielding varieties of barley and had to work with small 
sample sizes. 
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where z is obtained from the relation 

V4 1(2)21--, 
and W,-1(z) is the CDF of the t-distribution with n — 1 degrees of freedom, 


available in tables. These tables may be found in many sources. An abbreviated 
version is given in the opposite page. 


Probability Density Function 


— — N(0,1) 

t-distribution (nz11) | _ 
— — t-distribution (n=3) 
— — — t-distribution (nz2) 





Figure 9.3. The PDF of the t-distribution with n — 1 degrees of freedom in 
comparison with the standard normal PDF. 


On the other hand, when n is moderately large (e.g., n > 50), the t- 
distribution is very close to the normal distribution, and the normal tables can 
be used. 


Example 9.7. The weight of an object is measured eight times using an electronic 
scale that reports the true weight plus a random error that is normally distributed 
with zero mean and unknown variance. Assume that the errors in the observations 
are independent. The following results are obtained: 


0.5547, 0.5404, 0.6364, 0.6438, 0.4917, 0.5674, 0.5564, 0.6066. 
We compute a 95% confidence interval (a = 0.05) using the t-distribution. The 
value of the sample mean O, is 0.5747, and the estimated variance of O, is 


$2 1 - A AD -4 
mma a 240 — 6,)? = 32952 - 1074. 


i=l 
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| 0.100 0.050 0.025 0.010 0.005 0.001 


3.078 6.314 12.71 31.82 63.66 318.3 






O oo-1o0[coc'»c0ctN-— 


© 


The t-tables for the CDF V4.1(z) of the t-distribution with a given number 
n — 1 of degrees of freedom. The entries in this table are: 


Left column: Number of degrees of freedom n — 1. 
Top row: A desired tail probability 8. 
Entries under the top row: A value z such that Vn_1(z) 2 1 — 8. 


so that $4, / /n = 0.0182. From the t-distribution tables, we obtain 1— V7(2.365) = 
0.025 = a/2, so that 


pal eee 2365) = 0.95. 
Sn/Vn 
Thus, A ; 
Ôn — 2.365 2%, Ôn +2.365 22 = [0.531, 0.618] 
n . Vn’ n . Vn DX . U. 
is a 95% confidence interval. It is interesting to compare it with the confidence 
interval 


A $i A În 2 
le. - 1.96 =, On + 1.96 4 = [0.539, 0.610] 


obtained from the normal tables, which is narrower and therefore more optimistic 
about the precision of the estimate @ = 0.5747. 
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The approximate confidence intervals constructed so far relied on the par- 
ticular estimator $2 for the unknown variance v. However, different estimators 
or approximations of the variance are possible. For example, suppose that the 
observations X;,..., Xn are i.i.d. Bernoulli with unknown mean 0, and vari- 
ance v = 0(1 — 0). Then, instead of ed. the variance could be approximated 
by O,(1 — On). Indeed, as n increases, On converges to 6, in probability, from 
which it can be shown that Ó4(1— Ôn) converges to v. Another possibility is to 
just observe that 6(1 — 8) < 1/4 for all 8 € [0.1], and use 1/4 as a conservative 
estimate of the variance. The following example illustrates these alternatives. 


Example 9.8. Polling. Consider the polling problem of Section 5.4 (Example 
5.11), where we wish to estimate the fraction 0 of voters who support a particular 
candidate for office. We collect n independent sample voter responses X;,.... Xn, 
where X; is viewed as a Bernoulli random variable, with X i-1 if the ith voter sup- 
ports the candidate. We estimate 0 with the sample mean Op, and construct a con- 
fidence interval based on a normal approximation and different ways of estimating 
or approximating the unknown variance. For concreteness, suppose that 684 out of 
a sample of n = 1200 voters support the candidate, so that ©, = 684/1200 = 0.57. 


(a) If we use the variance estimate 


$2 -— Y 0 - ôn)? 


i-i 
d 684 4? 
12) ele ese (0 d $200) ) 


| 
pud 
ITN 
eo 
oo 
A 
“ee 
— 
\ 
| c 
les 
NN 


and treat Ôn as a normal random variable with mean @ and variance 0.245, 
we obtain the 9596 confidence interval 


6, 1.96 52. 6, +1.96 52] = [o.s7 199: VOTE 0.57 | 196. V025 
vn vn V 1200 VIG 
— [0.542, 0.598]. 
(b) The variance estimate 
^ A 684 684 
O4(1- On) = 1500. (1 - =a) = 0.245 


is the same as the previous one (up to three decimal place accuracy), and the 
resulting 95% confidence interval 


On - 1.96 SETTE On + 1.96 EE 


is again [0.542, 0.598]. 
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(c) The conservative upper bound of 1/4 for the variance results in the confidence 


interval 
^ 1/2 4 1/2 1.96 - (1/2) 1.96 - (1/2) 
©, — 1.96 =, On + 1.96 —— | = [0.57 — ————-, 0.57 + ———— 
/n 2 | v 1200 / 1200 
= [0.542, 0.599], 


which is only slightly wider, but practically the same as before. 


Figure 9.4 illustrates the confidence intervals obtained using methods (b) 
and (c), for a fixed value 6, = 0.57 and a range of sample sizes from n = 10 to 
n — 10,000. We see that when n is in the hundreds, as is typically the case in 
voter polling, the difference is slight. On the other hand, for small values of n, 
the different approaches result in fairly different confidence intervals, and therefore 
some care is required. 


Nee ES 


E 
oe? 
. 





“40! 10° 10° 10 
n 


Figure 9.4. The distance of the confidence interval endpoints from Ôn for meth- 
ods (b) and (c) of approximating the variance in the polling Example 9.8, when 
n = 0.57 and for a range of sample sizes from n = 10 to n = 10,000. 


9.2 LINEAR REGRESSION 


In this section, we develop the linear regression methodology for building a model 
of the relation between two or more variables of interest on the basis of available 
data. An interesting feature of this methodology is that it may be explained and 
developed simply as a least squares approximation procedure, without any prob- 
abilistic assumptions. Yet, the linear regression formulas may also be interpreted 


476 Classical Statistical Inference Chap. 9 


in the context of various probabilistic frameworks, which provide perspective and 
a mechanism for quantitative analysis. 

We first consider the case of only two variables, and then generalize. We 
wish to model the relation between two variables of interest, x and y (e.g., years 
of education and income), based on a collection of data pairs (zi, yi), à = 1,...,n. 
For example, x; could be the years of education and yi the annual income of the 
ith person in the sample. Often a two-dimensional plot of these samples indicates 
a systematic, approximately linear relation between zx; and y;. Then, it is natural 
to attempt to build a linear model of the form 


y 7 0o 4- 012. 


where ĝo and 81 are unknown parameters to be estimated. 
In particular, given some estimates 09 and 6, of the resulting parameters, 
the value y; corresponding to zi, as predicted by the model, is 


ji = ĉo + Îizi. 


Generally, j; will be different from the given value y;, and the corresponding 
difference 

ği y — Vi, 
is called the ith residual. A choice of estimates that results in small residuals 
is considered to provide a good fit to the data. With this motivation, the linear 
regression approach chooses the parameter estimates Êo and 6; that minimize 
the sum of the squared residuals. 


» vi — 4)? = Sou — bo — 6) 2;)?, 
t=) t=] 


over all 01 and 6: see Fig. 9.5 for an illustration. 


d Residual _ z 
(mw) YW Bo — Pray 


E 





Figure 9.5: Illustration of a set of data pairs (zi,yi), and a linear model y = 
09 01r, obtained by minimizing over 09,01 the sum of the squares of the residuals 
Yi — ĝo — ĝi zi. 
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Note that the postulated linear model may or may not be true. For exam- 
ple, the true relation between the two variables may be nonlinear. The linear 
least squares approach aims at finding the best possible linear model, and in- 
volves an implicit hypothesis that a linear model is valid. In practice, there is 
often an additional phase where we examine whether the hypothesis of a linear 
model is supported by the data and try to validate the estimated model. | 

To derive the formulas for the linear regression estimates 69 and 01, we 
observe that once the data are given, the sum of the squared residuals is a 
quadratic function of ĝo and 01. To perform the minimization, we set to zero the 
partial derivatives with respect to ĝo and 01. We obtain two linear equations in 
ĝo and 61, which can be solved explicitly. After some algebra, we find that the 
solution has a simple and appealing form, summarized below. 


Linear Regression 


Given n data pairs (zi,yi), the estimates that minimize the sum of the 
squared residuals are given by 


3 (zi — B)(yi - y) 
i=l 


6, = ZÈ, ĝo = y — 618, 
X (zi - 7)? 
i=l 


where 


le ig 
a 2 a 


Example 9.9. The leaning tower of Pisa continuously tilts over time. Measure- 
ments between years 1975 and 1987 of the “lean” of a fixed point on the tower (the 
distance in meters of the actual position of the point, and its position if the tower 
were straight) have produced the following table. 








Year 1975 1976 1977 1978 1979 1980 1981 
Lean || 2.9642 2.9644 2.9656 2.9667 2.9673 2.9688 2.9696 
Year 1982 1983 1984 1985 1986 1987 
Lean || 2.9698 2.9713 2.9717 2.9725 2.9742 2.9757 














Let us use linear regression to estimate the parameters ĝo and 01 in a model of the 
form y = 0o + ız, where z is the year and y is the lean. Using the regression 
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formulas, we obtain 


S (zi - zwi - 9) 
6, = = 20.009, 9) = 77 -— 61 F = 1.1233, 


where 


The estimated linear model is 
y = 0.0009z + 1.1233, 


and is illustrated in Figure 9.6. 





* Data Points 
Estimated Linear Model 








1974 1976 1978 1980 1982 1984 1986 1988 


Year 


Figure 9.6: The data and the estimated linear model for the lean of the tower 
of Pisa (Example 9.9). 


Justification of the Least Squares Formulation! 


The least squares formulation can be justified on the basis of probabilistic con- 

siderations in several different ways, based on different sets of assumptions. 
(a) Maximum likelihood (linear model, normal noise). We assume that 
the z; are given numbers (not random variables). We assume that y; is the 


f This subsection can be skipped without loss of continuity. 
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realization of a random variable Y;, generated according to a model of the 
form 
Yi = 00+ bixi + Wi, i= Vm 


where the W; are i.i.d. normal random variables with mean zero and vari- 
ance a”. It follows that the Y; are independent normal variables, where Y; 
has mean 69 + 012; and variance c?. The likelihood function takes the form 


fv(y 0) = II ; exp { MESS tiene Y 


20? 





Maximizing the likelihood function is the same as maximizing the exponent 
in the above expression, which is equivalent to minimizing the sum of the 
squared residuals. Thus, the linear regression estimates can be viewed as 
ML estimates within a suitable linear/normal context. In fact they can be 
shown to be unbiased estimates in this context. Furthermore, the variances 
of the estimates can be calculated using convenient formulas (see the end- 
of-chapter problems), and then used to construct confidence intervals using 
the methodology of Section 9.1. 


Approximate Bayesian linear LMS estimation (under a possibly 
nonlinear model). Suppose now that both zx; and y; are realizations of 
random variables X; and Y;. The different pairs (Xi, Yi) are i.i.d., but 
with unknown joint distribution. Consider an additional independent pair 
(Xo, Yo), with the same joint distribution. Suppose we observe Xo and 
wish to estimate Yo using a linear estimator of the form Yo = bo + 61 Xo. 
We know from Section 8.4 that the linear LMS estimator of Yo, given Xo, 
is of the form 


cov( Xo, Yo) 
E[Yo] + var( Xo) (Xo = E[Xo]), 
yielding 
= cov( Xo, Yo) ET 
icri bo = E[Yo] — €: E[Xo]. 


Since we do not know the distribution of (Xo, Yo), we use 7 as an estimate 
of E[Xo], y as an estimate of E[Yo], » 7; (xi — Z)(yi — y)/n as an estimate 
of cov(Xo, Yo), and $77 (zi — z)?/n as an estimate of var(Xo). By sub- 
stituting these estimates into the above formulas for 09 and 01, we recover 
the expressions for the linear regression parameter estimates given earlier. 
Note that this argument does not assume that a linear model is valid. 


Approximate Bayesian LMS estimation (linear model). Let the 
pairs (X;, Y;) be random and i.i.d. as in (b) above. Let us also make the 
additional assumption that the pairs satisfy a linear model of the form 


Yi = 09 + 01 Xi + Wi, 


where the W; are i.i.d., zero mean noise terms, independent of the .X;. From 
the least mean squares property of conditional expectations, we know that 
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E[Yo | Xo] minimizes the mean squared estimation error E[(Yo — g(X 9) ]. 
over all functions g. Under our assumptions, E[Yo | Xo] = 60 4-01Xo. Thus, 
the true parameters ĝo and 01 minimize 


B[(% - 6 - 0% Xo) 


over all 0 and 01. By the weak law of large numbers, this expression is the 
limit as n — oo of 


1 n 
= 3. - bo - 6X. 
il 


This indicates that we will obtain a good approximation of the minimizers 
of E|(Yo —- 606 — 01 Xo)?] (the true parameters), by minimizing the above 
expression (with X; and Y; replaced by their observed values z; and yi, 
respectively). But minimizing this expression is the same as minimizing 
the sum of the squared residuals. 


Bayesian Linear Regression! 


Linear models and regression are not exclusively tied to classical inference meth- 
ods. They can also be studied within a Bayesian framework, as we now explain. 
In particular, we may model 2,...,2n as given numbers, and yi,..., y, as the 
observed values of a vector Y = (Yi,...,Y4) of random variables that obey a 


linear relation 
Y; = Oo + Oizi + Wi. 


Here, © = (00,01) is the parameter to be estimated, and W|,..., Wn are i.i.d. 
random variables with mean zero and known variance 9?. Consistent with the 
Bayesian philosophy, we model Oo and ©; as random variables. We assume 
that Oo, O1, W41...., Wn are independent, and that Oo. 0; have mean zero and 
variances 32, c2. respectively. 

We may now derive a Bayesian estimator based on the MAP approach and 
the assumption that Oo, O1, and W,,..., Wn are normal random variables. We 


maximize over 60, 01 the posterior PDF fey (00,01 | Y1,- --, Yn). By Bayes’ rule, 
the posterior PDF ist 


fe(80.01) vie (yi. . . . n | 90,1), 


divided by a positive normalizing constant that does not depend on (66,61). 
Under our normality assumptions, this expression can be written as 


06 lr (yi — bo — 1161)? 
c-exp [ 33s ) exo (is ‘Lee Woven Am 


t This subsection can be skipped without loss of continuity. 
i Note that in this paragraph, we use conditional probability notation since we 
are dealing with a Bayesian framework. 


Sec. 9.2 Linear Regression 481 


where c is a normalizing constant that does not depend on (09.0;). Equivalently, 
we minimize over ĝo and 6; the expression 


Note the similarity with the expression Pado (yi — 0o — 110 )2, which is minimized 
in the earlier classical linear regression formulation. (The two minimizations 
would be identical if c9 and c1 were so large that the terms 03/202 and 6?/20? 
could be neglected.) The minimization is carried out by setting to zero the 
partial derivatives with respect to 09 and 01. After some algebra, we obtain the 
following solution. 


Bayesian Linear Regression 


e Model: 
(a) We assume a linear relation Y; = Oo + Oiz; + Wi. 
(b) The z; are modeled as known constants. 


(c) The random variables Oo, O1, W1,..., Wn are normal and inde- 
pendent. 


(d) The random variables Og and ©, have mean zero and variances 
a2, 02, respectively. 


(e) The random variables W; have mean zero and variance c?. 
e Estimation Formulas: 


Given the data pairs (zi, yi), the MAP estimates of Oo and O1 are 


c? n 
— > — «M (2: -3)(v - 9), 





482 Classical Statistical Inference Chap. 9 


We make a few remarks: 


(a) If c? is very large compared to o2 and 6?, we obtain ĝo ~ 0 and 6; ~ 0. 
What is happening here is that the observations are too noisy and are 
essentially ignored, so that the estimates become the same as the prior 
means, which we assumed to be zero. 


(b) If we let the prior variances o2 and o? increase to infinity, we are indicating 
the absence of any useful prior information on Oo and O1. In this case, 
the MAP estimates become independent of c?, and they agree with the 
classical linear regression formulas that we derived earlier. 


(c) Suppose, for simplicity, that z = 0. When estimating ©), the values y; of 
the observations Y; are weighted in proportion to the associated values 2;. 
This is intuitive: when z; is large, the contribution of Oz; to Y; is relatively 
large, and therefore Y; contains useful information on O4. Conversely, if x; 
is zero, the observation Y; is independent of ©; and can be ignored. 


(d) The estimates ĝo and 6; are linear functions of the yi, but not of the zi. 
Recall, however, that the x; are treated as exogenous, non-random quanti- 
ties, whereas the y; are observed values of the random variables Y;. Thus 
the MAP estimators Oo, Ói are linear estimators, in the sense defined in 
Section 8.4. It follows, in view of our normality assumptions, that the esti- 
mators are also Bayesian linear LMS estimators as well as LMS estimators 
(cf. the discussion near the end of Section 8.4). 


Multiple Linear Regression 


Our discussion of linear regression so far involved a single explanatory vari- 
able, namely z, a special case known as simple regression. The objective was 
to build a model that explains the observed values y; on the basis of the val- 
ues xi. Many phenomena, however, involve multiple underlying or explanatory 
variables. (For example, we may consider a model that tries to explain annual 
income as a function of both age and years of education.) Models of this type 
are called multiple regression models. 

For instance, suppose that our data consist of triples of the form (zi, yi, zi) 
and that we wish to estimate the parameters 0; of a model of the form 


y ~ 0o + 012 + 02z. 


As an example, y; may be the income, zx; the age, and z; the years of education 
of the ith person in a random sample. We then seek to minimize the sum of the 


squared residuals 
n 


Y vi — bo — O12 — 022i? 


i=l 


over all 0o, 01, and 02. More generally, there is no limit on the number of 
explanatory variables to be employed. The calculation of the regression estimates 
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6; is conceptually the same as for the case of a single explanatory variable, but 
of course the formulas are more complicated. 
As a special case, suppose that zi = z2, in which case we are dealing with 
a model of the form 
y 7 0o + 012 + 0227. 


“Such a model would be appropriate if there are good reasons to expect a quadratic 
dependence of yi on z;. (Of course, higher order polynomial models are also pos- 
sible.) While such a quadratic dependence is nonlinear, the underlying model is 
still said to be linear, in the sense that the unknown parameters 6; are linearly 
related to the observed random variables Y;. More generally, we may consider a 
model of the form 


m 
y = 09 + X 65h; (2), 
j=l 


where the hj are functions that capture the general form of the anticipated de- 


pendence of y on z. We may then obtain parameters B5; 01.250, by minimizing 
over 69,61,...,9m the expression 
n m 2 
5 (v — 60 — Y 6j) 
i=1 j=l 


This minimization problem is known to admit closed form as well as efficient 
numerical solutions. 


Nonlinear Regression 


There are nonlinear extensions of the linear regression methodology to situations 
where the assumed model structure is nonlinear in the unknown parameters. In 
particular, we assume that the variables z and y obey a relation of the form 


y = h(a;0), 


where h is a given function and @ is a parameter to be estimated. We are given 
data pairs (zi, yi), i = 1,...,n, and we seek a value of 0 that minimizes the sum 


of the squared residuals 
n 


` (yi — h(zi;0)).. 


i=1 


Unlike linear regression, this minimization problem does not admit, in gen- 
eral, a closed form solution. However, fairly efficient computational methods are 
available for solving it in practice. Similar to linear regression, nonlinear least 
squares can be motivated as ML estimation with a model of the form 


Yi = h(zi;0) + Wi, 
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where the W; are i.i.d. normal random variables with zero mean. To see this, 
note that the likelihood function takes the form 


FN ge (yi — (24:0) 
fiit = T] Fe ex - BH 


where c? is the variance of W;. Maximizing the likelihood function is the same 
as maximizing the exponent in the above expression, which is equivalent to 
minimizing the sum of the squared residuals. 





Practical Considerations 


Regression is used widely in many contexts, from engineering to the social sci- 
ences. Yet its application often requires caution. We discuss here some important 
issues that need to be kept in mind, and the main ways in which regression may 
fail to produce reliable estimates. 


(a) Heteroskedasticity. The motivation of linear regression as ML estima- 
tion in the presence of normal noise terms Wi contains the assumption that 
the variance of W; is the same for all i. Quite often, however, the variance 
of Wi varies substantially over the data pairs. For example, the variance of 
Wi may be strongly affected by the value of zi. (For a concrete example, 
suppose that zx; is yearly income and y; is yearly consumption. It is natural 
to expect that the variance of the consumption of rich individuals is much 
larger than that of poorer individuals.) In this case, a few noise terms with 
large variance may end up having an undue influence on the parameter 
estimates. An appropriate remedy is to consider a weighted least squares 
criterion of the form Y ;-; ai(yi — 0o — 012i)?, where the weights a; are 
smaller for those i for which the variance of W; is large. 


(b) Nonlinearity. Often a variable r can explain the values of a variable 
y, but the effect is nonlinear. As already discussed, a regression model 
based on data pairs of the form (h(zi), yi) may be more appropriate, with 
a suitably chosen function A. 


(c) Multicollinearity. Suppose that we use two explanatory variables x and z 
in a model that predicts another variable y. If the two variables z and z bear 
a strong relation, the estimation procedure may be unable to distinguish 
reliably the relative effects of each explanatory variable. For an extreme 
example, suppose that the true relation is of the form y = 2z + 1 and that 
the relation z = 2r alwaysholds. Then, the model y = z+1 is equally valid, 
and no estimation procedure can discriminate between these two models. 


(d) Overfitting. Multiple regression with a large number of explanatory vari- 
ables and a correspondingly large number of parameters to be estimated 
runs the danger of producing a model that fits the data well, but is oth- 
erwise useless. For example, suppose that a linear model is valid but we 
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choose to fit 10 given data pairs with a polynomial of degree 9. The re- 
sulting polynomial model will provide a perfect fit to the data, but will 
nevertheless be incorrect. As a rule of thumb, there should be at least five 
times (preferably ten times) more data points than there are parameters 
to be estimated. 


(e) Causality. The discovery of a linear relation between two variables z and 
y should not be mistaken for a discovery of a causal relation. A tight fit 
may be due to the fact that variable z has a causal effect on y. but may be 
equally due to a causal effect of y on z. Alternatively, there may be some 
external effect, described by yet another variable z, that affects both x and 
y in similar ways. For a concrete example let z; be the wealth of the first 
born child and yi be the wealth of the second born child in the same family. 
We expect yi to increase roughly linearly with zi, but this can be traced on 
the effect of a common family and background rather than a causal effect 
of one child on the other. 


9.3 BINARY HYPOTHESIS TESTING 


In this section, we revisit the problem of choosing between two hypotheses, but 
unlike the Bayesian formulation of Section 8.2, we will assume no prior proba- 
bilities. We may view this as an inference problem where the parameter 8 takes 
just two values, but consistent with historical usage, we will forgo the 0-notation 
and denote the two hypotheses as Hp and Hj. In traditional statistical language, 
hypothesis Ho is often called the null hypothesis and H, the alternative hy- 
pothesis. This indicates that Ho plays the role of a default model, to be proved 
or disproved on the basis of available data. 

The available observation is a vector X = (X4,..., X4) of random variables 
whose distribution depends on the hypothesis. We will use the notation P(X € 
A; Hj) to denote the probability that the observation X belongs to a set A 
when hypothesis H; is true. Note that consistent with the classical inference 
framework, these are not conditional probabilities, because the true hypothesis 
is not treated as a random variable. Similarly, we will use notation such as 
px(z; Hj) or fx(x; Hj) to denote the PMF or PDF, respectively, of the vector 
X, under hypothesis Hj. We want to find a decision rule that maps the realized 
values z of the observation to one of the two hypotheses (see Fig. 9.7). 













gle) = Ho 


pxt-: Ha) or 





Observation 
Process 


Decision 
Rule 9 







Figure 9.7: Classical inference framework for binary hypothesis testing. 
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Any decision rule can be represented by a partition of the set of all possible 
values of the observation vector X = (X1,..., Xn) into two subsets: a set R, 
called the rejection region, and its complement, R^, called the acceptance 
region. Hypothesis Ho is rejected (declared to be false) when the observed 
data X = (X1,..., Xn) happen to fall in the rejection region R and is accepted 
(declared to be true) otherwise; see Fig. 9.8. Thus, the choice of a decision rule 
is equivalent to choosing the rejection region. 


Space of Possible Observations .r 






Acceptance Region Rc 
Accept Ho 









Rejection Region R 
Reject Ho 

















Ho True 
Type 1 Error 


H i True 
Type H Error 


Ho True 
No Error 


Figure 9.8: Structure of a decision rule for binary hypothesis testing. It is 
specified by a partition of the set of all possible observations into a set R and 
its complement R*. The null hypothesis is rejected if the realized value of the 
observation falls in the rejection region. 


For a particular choice of the rejection region R, there are two possible 
types of errors: 


(a) Reject Ho even though Ho is true. This is called a Type I error, or a false 
rejection, and happens with probability 


a(R) = P(X € R; Ho). 


(b) Accept Ho even though Ho is false. This is called a Type II error, or a 
false acceptance, and happens with probability 


B(R) = P(X € R Hi). 


To motivate a particular form of rejection region, we draw an analogy with 
Bayesian hypothesis testing, involving two hypotheses © = ĝa and © = 6i, with 
respective prior probabilities pe(@0) and pe(@1). Then, the overall probability 
of error is minimized by using the MAP rule: given the observed value x of X, 
declare © = 04 to be true if 


De(8o)»x|e(x | 8) < pe(81)pxie(x | 41) 
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(assuming that X is discrete).T This decision rule can be rewritten as follows: 


define the likelihood ratio L(z) by 
_ Pxje(z|81) 
Px|e(z | Ao) 


^y 


and declare © = 04 to be true if the realized value x of the observation vector X 


satisfies 
L(x) > &, 


where the critical value € is 





_ pe(60) 
$= pe(A1) 


If X is continuous, the approach is the same, except that the likelihood ratio is 
defined as a ratio of PDFs: 


L(z)- fxje(z |01) 
fxje(z | Ao) 


Motivated by the preceding form of the MAP rule, we are led to consider 
rejection regions of the form 


R= {z| L(x) > £}, 
where the likelihood ratio L(x) is defined similar to the Bayesian case:t 


E f[x(z; Hi) 
fx (z; Ho) 


The critical value £ remains free to be chosen on the basis of other considerations. 
The special case where € = 1 corresponds to the ML rule. 


_ px(zi Hı) 
L(z) = px(z; Ho)’ or L(x) 


Example 9.10. We have a six-sided die that we want to test for fairness, and we 
formulate two hypotheses for the probabilities of the six faces: 


Ho (fair die): px (z; Ho) = 1 z=1,...,6, 
3 if z = 1,2, 
Hi (loaded die): px(z; Hı) = i 
Z, ifr = 3,4,5,6. 


1 In this paragraph, we use conditional probability notation since we are dealing 
with a Bayesian framework. 

T Note that we use L(x) to denote the value of the likelihood ratio based on 
the observed value z of the random observation X. On the other hand, before the 
experiment is carried out, the likelihood ratio is best viewed as a random variable, a 
function of the observation X, in which case it is denoted by L(X). The probability 
distribution of L(.X) depends, of course, on which hypothesis is true. 
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The likelihood ratio for a single roll x of the die is 


D if z= 1,2, 
raya) 2 

1/8 3 . 

JO L2. ifr =3,4,5,6. 

16 4 7 


Since the likelihood ratio takes only two distinct values. there are three possibilities 
to consider for the critical value £. with three corresponding rejection regions: 


3 
£5 reject Ho for all zx: 
4 
5 «S : accept Ho if x = 3.4,5,6; reject Ho if z = 1, 2; 
Š <E: accept Ho for all x. 


Intuitively. a roll of 1 or 2 provides evidence that favors Hi. and we tend to reject 
Ho. On the other hand. if we set the critical value too high (€ > 3/2), we never 
reject Ho. In fact. for a single roll of the die. the test makes sense only in the case 
3/4 « € « 3/2. since for other values of £. the decision does not depend on the 
observation. 

The error probabilities can be calculated from the problem data for each 
critical value. In particular. the probability of false rejection P(Reject Ho: Ho) is 


3 
1, if DE 
1 = : 3 
a(€) = P(X = 1.2: Ho) = 4. if 4 <E< 3° 
ES 
0. if 5 « €. 


and the probability of false acceptance P(Accept Ho: Hi) is 


: 3 
nös PUS*AsH)si d? eee 
(€) = Ue ES UNE MS Ife 4 2? 
3 
l. if = ; 
if5<é 


Note that choosing € trades off the probabilities of the two types of errors, as 
illustrated by the preceding example. Indeed, as £ increases, the rejection region 
becomes smaller. As a result, the false rejection probability a(R) decreases, 
while the false acceptance probability G(R) increases (see Fig. 9.9). Because of 
this tradeoff, there is no single best way of choosing the critical value. The most 
popular approach is as follows. 
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Figure 9.9: Error probabilities in a likelihood ratio test. As the critica) value € 
increases, the rejection region becomes smaller. As a result, the false rejection 
probability a decreases. while the faise acceptance probability 8 increases. When 
the dependence of a on € is continuous and strictly decreasing, there is a unique 
value of € that corresponds to a given a (see the figure on the left). However, the 
dependence of a on € may not be continuous. e.g., if the likelihood ratio L(z) can 
only take finitely many different values (see the figure on the right). 














Likelihood Ratio Test (LRT) 
e Start with a target value a for the false rejection probability. 


e Choose a value for £ such that the false rejection probability is equal 
to a: 


P(L(X) > €; Ho) =a. 


e Once the value z of X is observed, reject Ho if L(x) > £. 





Typical choices for a are a = 0.1, a = 0.05. or a = 0.01, depending on the 
degree of undesirability of false rejection. Note that to be able to apply the LRT 
to a given problem, the following are required: 


(a) We must be able to compute L(x) for any given observation value z, so 
that we can compare it with the critical value €. Fortunately. this is always 
the case when the underlying PMFs or PDFs are given in closed form. 


(b) We must either have a closed form expression for the distribution of L(X) 
lor of a related random variable such as log L(X)] or we must be able to 
approximate it analytically, computationally, or through simulation. This 
is needed to determine the critical value € that corresponds to a given false 
rejection probability a. 
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Figure 9.10: Rejection and acceptance regions in Example 9.11, and correspond- 
ing false rejection and false acceptance probabilities. 


Example 9.11. A surveillance camera periodically checks a certain area and 
records a signal X = W or X = 1+ W depending on whether an intruder is 
absent or present (hypotheses Ho or Hi, respectively). We assume that W is a 
normal random variable with mean 0 and known variance v. Since 


o» {=}, fx(z; Hi) = Xe Em. 


fx(z; Ho) = 2v 








1 
v2TU 
the likelihood ratio is 


Lir HO = exp {=~ - = exp { == =|. 


For a given critical value £, the LRT rejects Ho if L(r) > £, or equivalently, after a 
straightforward calculation, if 





1 
z > vlog + 2 

Thus, the rejection region is of the form 
R={z|z> 7} 


for some y, which corresponds to € via the relation 


1 
y = vlogé + PE 


see Fig. 9.10. We set a target value a for the false rejection probability, and we 
proceed to determine y from the relation 


a= P(X > y; Ho) = P(W > 7), 


and the normal tables. For example, if a = 0.025, then y = 1.96 yv. We may also 
calculate the false acceptance probability, 


B=P(X <y: H) =P(1+W<y)=P(W<7-1), 


by using again the normal tables. 
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When L(X) is a continuous random variable, as in the preceding example, 
the probability P(L(X ) > £; Ho) moves continuously from 1 to 0 as € increases. 
Thus, we can find a value of € for which the requirement P(L(X )> £; Ho) =a 
is satisfied. If, however, L(.X) is a discrete random variable, it may be impossible 
to satisfy the equality P(L(X ) >é; Ho) = a exactly, no matter how £ is chosen; 
cf. Example 9.10. In such cases, there are several possibilities: 


(a) Strive for approximate equality. 
(b) Choose the smallest value of £ that satisfies P(L(X) > & Ho) < a. 


(b) Use an exogenous source of randomness to choose between two alternative 
candidate critical values. This variant (known as a “randomized likelihood 
ratio test") is of some theoretical interest. However, it is not sufficiently 
important in practice to deserve further discussion in this book. 


We have motivated so far the use of a LRT through an analogy with 
Bayesian inference. However, we will now provide a stronger justification: for a 
given false rejection probability, the LRT offers the smallest possible false accep- 
tance probability. 


Neyman-Pearson Lemma 


Consider a particular choice of € in the LRT, which results in error proba- 
bilities 


P(L(X)»6Ho-o, P(L(X) <€;M) =8. 


Suppose that some other test, with rejection region R, achieves a smaller or 
equal false rejection probability: 


P(X € R; Ho) <a. 


Then, 
P(X ¢ R;Hi) 2 p, 


with strict inequality P(X ¢ R; H1) > 8 when P(X € R; Ho) <a. 





For a justification of the Neyman-Pearson Lemma, consider a hypothetical 
Bayesian decision problem where the prior probabilities of Ho and H; satisfy 


pe(6o) _ £ 
pe() ” 
so that 
E 1 
pe(0o) = THe pe(01) = IFE 


Then, the threshold used by the MAP rule is equal to £, as discussed in the 
beginning of this section, and the MAP rule is identical to the LRT rule. The 
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probability of error with the MAP rule is 


l 


ie 


CMAP = F 


— a 
1+€ 
and from Section 8.2, we know that it is smaller than or equal to the probability 
of error of any other Bayesian decision rule. This implies that for any choice of 
rejection region R, we have 


E 1 
€MAP € prout € R; Ho) + m P(X ¢ R; Hi). 
Comparing the preceding two relations, we see that if P(X € R; Ho) € o, we 
must have P(X £ R; Hi) > 2, and that if P(X € R; Hg) < a, we must have 
P(X ¢ R; H1) > B, which is the conclusion of the Neyman-Pearson Lemma. 
The Neyman-Pearson Lemma can be interpreted geometrically as shown 
in Fig. 9.11. We illustrate the lemma with a few examples. 









False Acceptance 
Probability 


Set 7 of pairs (a(R). 3(R)) 


(a(R), B(R)) 
Efficient Frontier 


False Rejection 
Probability 


Figure 9.11: Interpretation of the Neyman-Pearson Lemma. Consider the set £ 
of all error probability pairs (a(R), @(R)), as R ranges over all possible rejection 
regions (subsets of the observation space). The efficient frontier of £ is the 
set of all (a(R), 8(R)) € E such that there is no (a, 8) € E with a € a(R) and 
B < B(R), or a < a(R) and 8 < B(R). The Neyman-Pearson Lemma states that 
all pairs (a(£), 8(£)) corresponding to LRTs lie on the efficient frontier. 


Example 9.12. Consider Example 9.10, where we roll a six-sided die once and 
test it for fairness. We consider the set £ of all error probability pairs (a(R), B( R)) 
as R ranges over all possible rejection regions (al! subsets of the observation space 
(1,...,6)). The set £ is shown in Fig. 9.12 and it can be seen that the error 
probability pairs (1,0), (1/3, 1/2), and (0,1) associated with the LRTs have the 
property given by the Neyman-Pearson Lemma (i.e., lie on the efficient frontier, in 
the terminology of Fig. 9.11). 
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a Figure 9.12: Set of pairs (a(R), 8(R)) 
as the rejection region R ranges over all 
subsets of the observation space {1,..., 6} 
in Examples 9.10 and 9.12. The pairs 
(1,0), (0, 1), and (1/3, 1/2) are the ones 
ad i that correspond to LRTs. 


Example 9.13. Comparison of Different Rejection Regions. We observe 
two i.i.d. normal random variables X1 and X5, with unit variance. Under Ho their 
common mean is 0; under Hi, their common mean is 2. We fix the false rejection 
probability to a — 0.05. 

We first derive the form of the LRT, and then calculate the resulting value of 
B. The likelihood ratio is of the form 








1 
exp { - ((zi — 2)? + (x2 — 2))/2) 
L(x) = V2r i SE = exp {2(xı +22) - 4}. 
VP S (zi + x3)/2\ 


Comparing L(x) to a critical value € is equivalent to comparing zi + r2 to y = 
(4 + log £)/2. Thus, under the LRT, we decide in favor of Hi if rı r2 > y, for 
some particular choice of y. This determines the shape of the rejection region. 

To determine the exact form of the rejection region, we need to find y so that 
the false rejection probability P( X1 + X2 > y: Ho) is equal to 0.05. We note that 
under Ho, Z = (X1 + X2)/ V2 is a standard normal random variable. We have 


X1 + Xe ^y ^y 
0.05 = P(X; + Xo > y, Ho) = P | ——— > —; H = P | Z > — l. 
PORRE ( Và ^ V2 J ( X) 


From the normal tables, we obtain P(Z » 1.645) — 0.05, so we choose 
y = 1.645 - V2 = 2.33, 
resulting in the rejection region 
R= ((z1, 22) | ri 2» > 2.33}. 


To evaluate the performance of this test, we calculate the resulting false ac- 
ceptance probability. Note that under Hi, Xı + X» is normal with mean equal to 
4 and variance equal to 2, so that Z = (X1 + X5 — 4)//2 is a standard normal 
random variable. Thus, using the normal tables, the false acceptance probability is 
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given by 
B(R) = P(Xi4 X2 € 2.33; H1) 


X X4 _ 233-4 
ap (Ett « Vi an) 
= P(Z < —1.18) 
= P(Z > 1.18) 
=1-P(Z € 1.18) 
= 1 — 0.88 
= 0.12. 





We now compare the performance of the LRT with that resulting from a 
different rejection region R'. For example, let us consider a rejection region of the 
form 

R = { (x1, 22) | max(zi, z2) > C) 


where Ç is chosen so that the false rejection probability is again 0.05. To determine 
the value of C, we write 


0.05 = P( max(Xi, X2) > ¢; Ho) 
-1- P(max(Xi, X2) € G Ho) 
=1-— P(Xi € G Ho) P(X2 € G Ho) 
=1-(P(Z<¢;Ho))’, 
where Z is a standard normal. This yields P(Z € ¢; Ho) = /1—0.05 ~ 0.975. 
Using the normal tables, we conclude that ¢ = 1.96. 


Let us now calculate the resulting false acceptance probability. Letting Z be 
again a standard normal, we have 


B(R’) = P( max{X1, X2) < 1.96; H1) 
= (P(X; < 1.96; Hi)” 
= (P(X: — 2 < -0.04; H;))” 
= (P(Z < -0.04)) 
= (0.49)? 
= 024. 


We see that the false acceptance probability G(R) = 0.12 of the LRT is much better 
than the false acceptance probability 8(R') = 0.24 of the alternative test. 


Example 9.14. A Discrete Example. Consider n — 25 independent tosses of 
a coin. Under hypothesis Ho (respectively, H1), the probability of a head at each 
toss is equal to 0o = 1/2 (respectively, 6; = 2/3). Let X be the number of heads 
observed. If we set the false rejection probability to 0.1, what is the rejection region 
associated with the LRT? 
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We observe that when X = k, the likelihood ratio is of the form 








L(k) = (ra - r7 = (2 Ex. (2) -»(2). 


0 1-6; 1 — ĝo 8 


Note that L(k) is a monotonically increasing function of k. Thus, the rejection 
condition L(k) > £ is equivalent to a condition k > y, for a suitable value of y. We 
conclude that the LRT is of the form 


reject Ho if X > »y. 


To guarantee the requirement on the false rejection probability, we need to 
find the smallest possible value of y for which P(X > y; Ho) € 0.1, or 


By evaluating numerically the right-hand side above for different choices of y, we 
find that the required value is y = 16. 

An alternative method for choosing y involves an approximation based on the 
central limit theorem. Under Ho, 


_ X - n6 X - 12.5 


= Vmé 1-6) 35/4 


is approximately a standard normal random variable. Therefore, we need 


— 12. — 12. 
oa = PE > si) =P (2 29 ied dian) =p (z> 2-3) 


\/25/4 B \/25/4 ' 9 


From the normal tables, we have (1.28) = 0.9, and therefore, we should choose ~y 
so that (27/5) — 5 = 1.28, or y = 15.7. Since X is integer-valued, we find that the 
LRT should reject Ho whenever X » 15. 


9.4 SIGNIFICANCE TESTING 


Hypothesis testing problems encountered in realistic settings do not always in- 
volve two well-specified alternatives, so the methodology in the preceding section 
cannot be applied. The purpose of this section is to introduce an approach to 
this more general class of problems. We caution, however, that a unique or uni- 
versal methodology is not available, and that there is a significant element of 
judgment and art that comes into play. 
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For some motivation, consider problems such as the following: 
(i) A coin is tossed repeatedly and independently. Is the coin fair? 
(ii) A die is tossed repeatedly and independently. Is the die fair? 


(iii) We observe a sequence of i.i.d. normal random variables X;,..., Xn. Are 
they standard normal? 


(iv) Two different drug treatments are delivered to two different groups of pa- 
tients with the same disease. Is the first treatment more effective than the 
second? 


(v) On the basis of historical data (say, based on the last year), is the daily 
change of the Dow Jones Industrial Average normally distributed? 


(vi) On the basis of several sample pairs (zi, yi) of two random variables X and 
Y, can we determine whether the two random variables are independent? 


In all of the above cases, we are dealing with a phenomenon that involves 
uncertainty, presumably governed by a probabilistic model. We have a default 
hypothesis, usually called the null hypothesis, denoted by Ho, and we wish to 
determine on the basis of the observations X = (X,....Xn) whether the null 
hypothesis should be rejected or not. 

In order to avoid obscuring the key ideas, we will mostly restrict the scope 
of our discussion to situations with the following characteristics. 


(a) Parametric models: We assume that the observations X1,.... X, have 
a distribution governed by a joint PMF (discrete case) or a joint PDF (con- 
tinuous case), which is completely determined by an unknown parameter 
0 (scalar or vector), belonging to a given set M of possible parameters. 


(b) Simple null hypothesis: The null hypothesis asserts that the true value 
of 0 is equal to a given element ĝo of M. 


(c) Alternative hypothesis: The alternative hypothesis, denoted by Hi, is 
just the statement that Ho is not true, i.e., that 0 Z 09. 


In reference to the motivating examples introduced earlier, notice that ex- 
amples (i)-(ii) satisfy conditions (a)-(c) above. On the other hand, in examples 
(iv)-(vi), the null hypothesis is not simple. violating condition (b). 


The General Approach 


We introduce the general approach through a concrete example. We then sum- 
marize and comment on the various steps involved. Finally, we consider a few 
more examples that conform to the general approach. 


Example 9.15. Is My Coin Fair? A coin is tossed independently n = 1000 
times. Let 0 be the unknown probability of heads at each toss. The set of all 
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possible parameters is M = [0,1]. The null hypothesis Ho (“the coin is fair") is of 
the form 0 = 1/2. The alternative hypothesis is that 0 4 1/2. 

The observed data is a sequence Xi..... Xn. where X, equals 1 or 0. de- 
pending on whether the ith toss resulted in heads or tails. We choose to address 
the problem by considering the value of S = X; +--- + Xn, the number of heads 
observed, and using a decision rule of the form: 


reject Ho if |s- z| b. 


where £ is a suitable critical value. to be determined. We have so far defined the 
shape of the rejection region R (the set of data vectors that lead to rejection of 
the null hypothesis). We finally choose the critical value £ so that the probability 
of false rejection is equal to a given value a: 


P(reject Ho: Ho) = o, 


Typically. o, called the significance level. is a small number: in this example. we 
use o = 0.05. 

The discussion so far involved only a sequence of intuitive choices. Some 
probabilistic calculations are now needed to determine the critical value £. Under 
the null hypothesis, the random variable S is binomial with parameters n — 1000 
and p — 1/2. Using the normal approximation to the binomial and the normal 
tables, we find that an appropriate choice is £ — 31. If. for example. the observed 
value of S turns out to be s — 472, we have 


|s — 500| = |472 — 500| = 28 € 31 
and the hypothesis Ho is not rejected at the 5% significance level. 


Our use of the language “not rejected” as opposed to “accepted.” at the 
end of the preceding example is deliberate. We do not have anv firm grounds to 
assert that 0 equals 1/2. as opposed to. say. 0.51. We can only assert that the 
observed value of S does not provide strong evidence against hypothesis Hy. 

We can now summarize and generalize the essence of the preceding example. 
to obtain a generic methodology. 





Significance Testing Methodology 


A statistical test of a hypothesis “Ho : 0 = 0*" is to be performed, based 
on the observations X1,..., Xn. 


e The following steps are carried out before the data arc observed. 


(a) Choose a statistic S, that is, a scalar random variable that 
will summarize the data to be obtained. Mathematically, this 
involves the choice of a function h : R” — R. resulting in the 
statistic S = h(X1.... Xn). 
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(b) Determine the shape of the rejection region by specifying 
the set of values of S for which Ho will be rejected as a function 
of a yet undetermined critical value £. 


(c) Choose the significance level, i.e., the desired probability a of 
a false rejection of Ho. 


(d) Choose the critical value £ so that the probability of false re- 
jection is equal (or approximately equal) to a. At this point, the 
rejection region is completely determined. 


e Once the values z1,...,z4 of X1,..., Xn are observed: 
(i) Calculate the value s = h(z1,...,Zn) of the statistic S. 
(ii) Reject the hypothesis Ho if s belongs to the rejection region. 





Let us add some comments and interpretation for the various elements of 


the above methodology. 


(i) 


(iii) 


There is no universal method for choosing the “right” statistic S. In some 
cases, as in Example 9.15, the choice is natural and can also be justified 
mathematically. In other cases, a meaningful choice of S involves a certain 
generalization of the likelihood ratio, to be touched upon later in this sec- 
tion. Finally. in many situations, the primary consideration is whether S 
is simple enough to enable the calculations needed in step (d) of the above 
methodology. 


The set of values of S under which Ho is not rejected is usually an interval 
surrounding the peak of the distribution of S under Ho (see Fig. 9.13). In 
the limit of a large sample size n. the central limit theorem often applies 
to S, and the symmetry of the normal distribution suggests an interval 
which is symmetric around the mean value of S. Similarly, the symmetry 
of the rejection region in Example 9.15 is well-motivated by the fact that, 
under Ho, the distribution of S (binomial with parameter 1/2) is symmetric 
around its mean. In other cases, however, nonsymmetric rejection regions 
are more appropriate. For example, if we are certain that the coin in 
Example 9.15 satisfies 0 > 1/2, a one-sided rejection region is natural: 


reject Ho if 5 — k >£. 


Typical choices for the false rejection probability a range between a = .10 
and a = 0.01. Of course, one wishes false rejections to be rare, but in 
light of the tradeoff discussed in the context of simple binary hypotheses, a 
smaller value of a makes it more difficult to reject a false hypothesis, i.e., 
increases the probability of false acceptance. 
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Figure 9.13: Two-sided and one-sided rejection regions for significance testing, 
based on a statistic S with mean s under the null hypothesis. The significance 
level is the probability of false rejection, i.e., the probability, under Ho, that the 
statistic S takes a value within the rejection region. 


(iv) Step (d) is the only place where probabilistic calculations are used. It re- 
quires that the distribution of L( X) [or of a related random variable such 
as log L( X)| under the hypothesis Ho be available, possibly approximately. 
In special cases, this is straightforward or involves an exercise in derived 
distributions. However, except for relatively simple situations, the distri- 
bution of S cannot be found in closed form. If n is large, one can often use 
well-justified approximations, e.g., based on the central limit theorem. On 
the other hand, if n is moderate, useful approximations may be difficult to 
obtain. For this reason, the choice of the statistic S is sometimes guided 
by the desire to obtain a tractable expression or approximation for the 
distribution of S. Alternatively, the distribution of S may be estimated 
by simulation, e.g., by generating many independent samples of X, and 
by using the resulting samples of L(X) to build a histogram/estimated 
distribution. 


Given the value of a, if the hypothesis Ho ends up being rejected, one says 
that Hg is rejected at the o significance level. This statement needs to be 
interpreted properly. It does not mean that the probability of Ho being true is 
less than a. Instead, it means that when this particular methodology is used, 
we will have false rejections a fraction a of the time. Rejecting a hypothesis at 
the 1% significance level means that the observed data are highly unusual under 
the model associated with Ho; such data would arise only 196 of the time, and 
thus provide strong evidence that Ho may be false. 

Quite often, statisticians skip steps (c) and (d) in the above described 
methodology. Instead, once they calculate the realized value s of S, they deter- 
mine and report an associated p-value defined by 

p-value = min(o | Ho would be rejected at the a significance level). 
Equivalently, the p-value is the value of a for which s would be exactly at the 
threshold between rejection and non-rejection. Thus, for example, the null hy- 
pothesis would be rejected at the 5% significance level if and only if the p-value 
is smaller than 0.05. 
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A few examples illustrate the main ideas. 


Example 9.16. Is the Mean of a Normal Equal to Zero? Here we assume 
that each X; is an independent normal random variable, with mean 0 and known 
variance o°. The hypotheses under consideration are: 


Ho: 0 = 0; Hı:0 #0. 


A reasonable statistic here is the sample mean (Xi + --- + Xn)/n or its scaled 


version 
"NU 


oyn 
A natural choice for the shape of the rejection region is to reject Ho if and only if 
|S| > €. Because S has a standard normal distribution, the value of £ corresponding 
to any particular value of a is easily found from the normal tables. For example, if 
o = 0.05, we use the fact that P(S < 1.96) = 0.975 to obtain a rejection region of 
the form 


S 


reject Ho if |S| > 1.96, 


or equivalently, 
reject Ho if [X1 +--+ Xn| > 1.960 /n. 


In a one-sided version of this problem, the alternative hypothesis is of the 
form H,: 0 » 0. In this case, the same statistic S can be used, but we will reject 
Ho if S > £, where € is chosen so that P(S > £) = o. Once more, since S has a 
standard normal distribution, the value of £ corresponding to any particular value 
of a is easily found from the normal tables. 

Finally, if the variance c? is unknown, we may replace it by an estimate such 
as 

: 1 (yu merce 
n 
i=l 
In this case, the resulting statistic has a t-distribution (as opposed to normal). If 
n is relatively small, the t-tables should be used instead of the normal tables (cf. 
Section 9.1). 


Our next example involves a composite null hypothesis Ho, in the sense 
that there are multiple parameter choices that are compatible with Ho. 


Example 9.17. Are the Means of Two Populations Equal? We want to 
test whether a certain medication is equally effective for two different population 
groups. We draw independent samples X;,..., X^, and Y1,..., Yno from the two 
populations, where X; = 1 (or Y; = 1) if the medication is effective for the ith person 
in the first (respectively, the second) group, and X, = 0 (or Y; = 0) otherwise. 
We view each Xi (or Y;) as a Bernoulli random variable with unknown mean 6x 
(respectively, 9y), and we consider the hypotheses 


Ho: 0x = Oy, A, : 0x # Oy. 
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Note that there are multiple pairs (0x,0y) that are compatible with Ho, which 
makes Ho a composite hypothesis. 
The sample means for the two populations are 


Sed is 


A reasonable estimator of 0x — y is Ôx — Oy. A plausible choice is to reject Ho 
if and only if . y 
lex — Əyļ| >t, 


for a suitable threshold t to be determined on the basis of the given false rejection 
probability o. However, an appropriate choice of t is made difficult by the fact that 
the distribution of Ôx — Oy under Ho depends on the unspecified parameters 0x 
and y. This motivates a somewhat different statistic, as we discuss next. 

For large nı and n2, the sample means Ôx and Oy are approximately normal, 
and because they are independent, Ó x -Óy is also approximately normal with mean 
0x — Oy and variance 


var(Óx — Oy) = var(Óx) + var(Oy) = ES oe) + Ed ea d 

nı n2 
Under hypothesis Ho, the mean of Ôx — Oy is known (equal to zero), but its 
variance is not, because the common value of 6x and Oy is not known. On the 
other hand, under Ho, the common value of 6x and Oy can be estimated by the 
overall sample mean 


and (Ox — Óy) /6 is approximately a standard normal random variable. This leads 
us to consider a rejection region of the form 


reject Ho if 9x ev] >£, 
a 


and to choose £ so that ®(€) = 1 — a/2, where 6 is the standard normal CDF. For 
example, if a = 0.05, we obtain a rejection region of the form 


reject Ho if > 1.96. 


lóx - Oy| 
ô 
In a variant of the methodology in this example, we may consider the hy- 
potheses 


Ho: 0x = 6y, Hi:0x > by, 
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which would be appropriate if we had reason to exclude the possibility 0x « Oy. 
Then, the corresponding rejection region should be one-sided, of the form 


© 9x-Ó 
reject Ho if xO e 
ó 


where € is chosen so that $(£) = 1 — a. 


The preceding example illustrates a generic issue that arises whenever the 
null hypothesis is composite. In order to be able to set the critical value appro- 
priately, it is preferable to work with a statistic whose approximate distribution 
is available and is the same for all parameter values compatible with the null 
hypothesis, as was the case for the statistic (Ox — Oy) /6 in Example 9.17. 


Generalized Likelihood Ratio and Goodness of Fit Tests 


Our last topic involves testing whether a given PMF conforms with observed 
data. This an important problem, known as testing for goodness of fit. We 
will also use it as an introduction to a general methodology for significance testing 
in the face of a composite alternative hypothesis. 

Consider a random variable that takes values in the finite set (1,..., m), 
and let 6, be the probability of outcome k. Thus, the distribution (PMF) of 
this random variable is described by the vector parameter 0 = (01,...,0,,). We 
consider the hypotheses 


Ho: 0 = (085252508), Hy: OF (0%,..., 0m), 


where the 07 are given nonnegative numbers that sum to 1. We draw n inde- 
pendent samples of the random variable of interest, and let Nx be the number 
of samples that result in outcome k. Thus, our observation is X = (Ni1,..., Nm) 
and we denote its realized value by z = (ni,...,Nm). Note that Ni +---+Nm = 
Mes tNm HN. 

As a concrete example, consider n independent rolls of a die and the hy- 
pothesis Ho that the die is fair. In this case, 0; = 1/6, for k = 1,...,6, and 
Nx is the number of rolls whose result was equal to k. Note that the alternative 
hypothesis Hj is composite, as it is compatible with multiple choices of 0. 

The approach that we will follow is known as a generalized likelihood 
ratio test and involves two steps: 


(a) Estimate a model by ML, i.e., determine a parameter vector 6 = (61, Tem Ôm) 
that maximizes the likelihood function px (z; 0) over all vectors 8. 
(b) Carry out a LRT that compares the likelihood px(z; 0*) under Ho to the 


likelihood px (z; 8) corresponding to the estimated model. More concretely, 
form the generalized likelihood ratio 


px (z; 6) 
px(z; 0*)' 
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and if it exceeds a critical value £, reject Ho. As in binary hypothesis test- 
ing, we choose £ so that the probability of false rejection is (approximately) 
equal to a given significance level a. 


In essence, this approach asks the following question: is there a model com- 
patible with Hı that provides a better explanation for the observed data than 
that provided by the model corresponding to Ho? To answer this question, we 
compare the likelihood under Ho to the largest possible likelihood under models 
compatible with Hj. 

The first step (ML estimation) involves a maximization over the set of 
probability distributions (01,...,0,,). The PMF of the observation vector X is 
multinomial (cf. Problem 27 in Chapter 2), and the likelihood function is 


px(z; 0) = c0 --- 05r, 


where c is a normalizing constant. It is easier to work with the log-likelihood 
function, which takes the form 


log px (z; 0) = logc4-nilog014--:---ng-1log65- 1-4 nga log(1—01 —:::— m1); 


where we have also used the fact 01 + ----- Ôm = 1 to eliminate 0,4. Assuming 
that the vector 0 that maximizes the log-likelihood has positive components, it 
can be found by setting the derivatives with respect to 01,...,05,1 of the above 
expression to zero, which yields 


= m —————————; fork 2 1,...,m- 1. 


Since the term on the right-hand side is equal to nm/ Êm, we conclude that all 
ratios ny /0, must be equal. Using also the fact n1 +---+nm = n, it follows that 
A n 
ó, = —, k=1,...,m. 
n 
It can be shown that these are the correct ML estimates even if some of the nk 
happen to be zero, in which case the corresponding 6, are also zero. 
The resulting generalized likelihood ratio test is of the formt 


l Q0 Px(z8) _ pp (k/n)s 

reject Ho if ———— = c 
px(z; 0*) Il 

where € is the critical value. By taking logarithms, the test reduces to 


Nk 
nô% 





reject Ho if 3n log ( ) » log£. 
k=1 


t We adopt the convention that 0° = 1 and 0-log0 = 0 
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We need to determine € by taking into account the required significance level, 
that is, 
P(S »log&; Ho) = a, 


m N; 
S =~ Nrlog (2E). 

k=1 
This may be problematic because the distribution of S under Hy is not readily 
available and can only be simulated. 

Fortunately, major simplifications are possible when n is large. In this 
case, the observed frequencies 0, = n;/n will be close to 0; under Ho, with 
high probability. Then, a second order Taylor series expansion shows that our 
statistic S can be approximated well by T'/2, where T' is given byt 


where 





Furthermore. when n is large, it is known that under the hypothesis Ho. the 
distribution of T (and consequently the distribution of 2S) approaches a so-called 
*x? distribution with m — 1 degrees of freedom.” The CDF of this distribution 


1 We note that the second order Taylor series expansion of the function y log(y/y*) 
around any y' > 0 is of the form 


* 1 -y* 
nos (2) Sy-y T 
y 


and is valid when y/y* ~ 1. Thus, 





i The x? distribution with £ degrees of freedom is defined as the distribution of 
the random variable ; 
$5 
i=1 


where Zi,...,Z, are independent standard normal random variables (zero mean and 
unit variance). Some intuition for why T is approximately x? can be gained from the 
fact that as n — oo, Ny/n not only converges to 0; but is also asymptotically normal. 
Thus, T is equal to the sum of the squares of m zero mean normal random variables, 
namely (Ny — n0;)/4/n0;. The reason that T has m — 1, instead of m. degrees of 
freedom is related to the fact that aM ; Ne = n, so that these ji random variables are 


actually dependent. 
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is available in tables, similar to the normal tables. Thus, approximately correct 
values of P(T > 4; Ho) or P(2S > «; Ho) can be obtained from the x? tables 
and can be used to determine a suitable critical value that corresponds to the 
given significance level o. Putting everything together, we have the following 
test for large values of n. 


The Chi-Square Test: 


e Use the statistic 


ii N, 
p Ne ior s) 


(or possibly the related statistic T) and a rejection region of the form 


reject Ho if 25 > y 


(or T > y, respectively). 


e The critical value y is determined from the CDF tables for the x? 
distribution with m — 1 degrees of freedom so that 


P(2S > 7 Ho) =a, 


where a is a given significance level. 





Example 9.18. Is My Die Fair? A die is rolled independently 600 times and 
the number of times that the numbers 1,2.3.4,5,6 come up are 


nı = 92, no=120, na =88, n4 =98, msg-95, ne = 107, 


respectively. Let us test the hypothesis Ho that the die is fair by using the chi- 
square test based on the statistic T, at a level of significance a = 0.05. From the 
tables for the x? with 5 degrees of freedom, we obtain that for P(T > y; Ho) = 0.05 
we must have + = 11.1. 

With 01 = --- = 0$ = 1/6. n = 600, n0; = 100, and the given values ng, the 
value of the statistic T' is 


S (nk —n6;))  (92— 100)? , 020- 100)? , 88- 100)? 


La nÉ 100 100 100 


M (98 — 100)? " (95 —100)? (107 - 100)? 
100 100 + 100 
= 6.86. 


Since T = 6.86 < 11.1, the hypothesis that the die is fair is not rejected. If we use 
instead the statistic S. then a calculation using the data yields 2S = 6.68, which 
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is both close to T and also well below the critical value y = 11.1. If the level of 
significance were o = 0.25, the corresponding value of y would be 6.63. In this 
case, the hypothesis that the die is fair would be rejected since T' — 6.86 » 6.63 
and 2S = 6.68 > 6.63. 


9.5 SUMMARY AND DISCUSSION 


Classical inference methods, in contrast with Bayesian methods, treat 0 as an 
unknown constant. Classical parameter estimation aims at estimators with fa- 
vorable properties such as a small bias and a satisfactory confidence interval, for 
all possible values of 0. We first focused on ML estimation, which is related to 
the (Bayesian) MAP method and selects an estimate of 0 that maximizes the 
likelihood function given z. It is a general estimation method and has several 
desirable characteristics, particularly when the number of observations is large. 
Then, we discussed the special but practically important case of estimating an 
unknown mean and constructing confidence intervals. Much of the methodology 
here relies on the central limit theorem. We finally discussed the linear regression 
method that aims to match a linear model to the observations in a least squares 
sense. It requires no probabilistic assumptions for its application, but it is also 
related to ML and Bayesian LMS estimation under certain conditions. 

Classical hypothesis testing methods aim at small error probabilities, com- 
bined with simplicity and convenience of calculation. We have focused on tests 
that reject the null hypothesis when the observations fall within a simple type of 
rejection region. The likelihood ratio test is the primary approach for the case of 
two competing simple hypotheses, and derives strong theoretical support from 
the Neyman-Pearson Lemma. We also addressed significance testing, which ap- 
plies when one (or both) of the competing hypotheses is composite. The main 
approach here involves a suitably chosen statistic that summarizes the observa- 
tions, and a rejection region whose probability under the null hypothesis is set 
to a desired significance level. 

In our brief introduction to statistics, we aimed at illustrating the central 
concepts and the most common methodologies, but we have barely touched the 
surface of a very rich subject. For example, we have not discussed important 
topics such as estimation in time-varying environments (time series analysis, and 
filtering), nonparametric estimation (e.g., the problem of estimating an unknown 
PDF on the basis of empirical data), further developments in linear and nonlinear 
regression (e.g., testing whether the assumptions underlying a regression model 
are valid), methods for designing statistical experiments, methods for validating 
the conclusions of a statistical study, computational methods, and many others. 
Yet, we hope to have kindled the reader's interest in the subject and to have 
provided some general understanding of the conceptual framework. 
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PROBLEMS 


SECTION 9.1. Classical Parameter Estimation 


Problem 1. Alice models the time that she spends each week on homework as 
an exponentially distributed random variable with unknown parameter 0. Homework 
times in different weeks are independent. After spending 10, 14, 18,8, and 20 hours in 
the first 5 weeks of the semester, what is her ML estimate of 0? 


Problem 2. Consider a sequence of independent coin tosses, and let 0 be the proba- 
bility of heads at each toss. 


(a) Fix some k and let N be the number of tosses until the kth head occurs. Find 
the ML estimator of 6 based on N. 


(b) Fix some n and let K be the number of heads observed in n tosses. Find the ML 
estimator of 0 based on K. 


Problem 3. Sampling and estimation of sums. We have a box with k balls; k of 
them are white and k — k are red. Both k and k are assumed known. Each white ball 
has a nonzero number on it, and each red ball has zero on it. We want to calculate 
the sum of all the ball numbers, but because k is very large, we resort to estimating 
it by sampling. This problem aims to quantify the advantages of sampling only white 
balls/nonzero numbers and exploiting the knowledge of k. In particular, we wish to 
compare the error variance when we sample n balls with the error variance when we 
sample a smaller number m of white balls. 


(a) Suppose we draw balls sequentially and independently, according to a uniform 
distribution (with replacement). Denote by X; the number on the ith ball drawn, 
and by Y, the number on the ith white ball drawn. We fix two positive integers 
n and m, and denote 


where N is the (random) number of white balls drawn in the first n samples. Show 
that $, S, and S are unbiased estimators of the sum of all the ball numbers. 


(b) Calculate the variances of S and $, and show that in order for them to be 
approximately equal, we must have 


np 
mx — —— 
pcr(l-p) 


where p = k/k and r = E[Y?]/var(Yi). Show also that when m = n, 


var(S) p 





va($)  ptr(l~p) 
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(c) Calculate the variance of S, and show that for large n, 





Problem 4. Mixture models. Let the PDF of a random variable X be the mixture 
of m components: 


Ix(r) =) 0 pify, (2). 


j=l 


where 


Thus, X can be viewed as being generated by a two-step process: first draw j randomly 
according to probabilities pj, then draw randomly according to the distribution of Y;. 
Assume that each Y; is normal with mean p, and variance o?, and that we have a set 
of i.i.d. observations Xj...., Xn, each with PDF fx. 


(a) Write down the likelihood and log-likelihood functions. 


(b) Consider the case m = 2 and n = 1, and assume that qi, 52, c1, and c3 are 
known. Find the ML estimates of p; and pa. 


(c) Consider the case m = 2 and n = 1, and assume that pi, p2, c1, and o2 are 
known. Find the ML estimates of pı and po. 


(d) Consider the case m > 2 and general n, and assume that all parameters are 
unknown. Show that the likelihood function can be made arbitrarily large by 
choosing 41 = zi and letting c? decrease to zero. Note: This is an example 
where the ML approach is problernatic. 


Problem 5. Unstable particles are emitted from a source and decay at a distance 
X, which is exponentially distributed with unknown parameter 0. A special device is 
used to detect the first n decay events that occur in the interval [mi, m2]. Suppose 
that these events are recorded at distances X = (X1,..., Xn). 


(a) Give the form of the likelihood and log-likelihood functions. 


(b) Assume that m; = 1, m2 = 20, n = 6, and z = (1.5,2,3,4,5,12). Plot the likeli- 
hood and log-likelihood as functions of 0. Find approximately the ML estimate 
of 0 based on your plot. 


Problem 6. Consider a study of student heights in a middle school. Assume that the 
height of a female student is normally distributed with mean pı and variance o?, and 
that the height of a male student is normally distributed with mean p2 and variance 
c2. Assume that a student is equally likely to be male or female. A sample of size 
n — 10 was collected. and the following values were recorded (in centimeters): 


164, 167, 163, 158, 170, 183, 176, 159, 170, 167. 
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(a) Assume that 1, #2, 01, and 22 are unknown. Write down the likelihood function. 


(b) Assume we know that o? = 9 and pı = 164. Find numerically the ML estimates 
of 2» and 42. 


(c) Assume we know that c? = o2 = 9. Find numerically the ML estimates of ua 
and 42. 


(d) Treating the estimates obtained in part (c) as exact values, describe the MAP 
rule for deciding a student's gender based on the student's height. 


Problem 7. Estimating the parameter of a Poisson random variable. De- 
rive the ML estimator of the parameter of a Poisson random variable based on i.i.d. 
observations X,...,Xn. Is the estimator unbiased and consistent? 


Problem 8. Estimating the parameter of a uniform random variable I. We 
are given i.i.d. observations X;,..., Xn that are uniformly distributed over the interval 
[0,0]. What is the ML estimator of 0? Is it consistent? Is it unbiased or asymptotically 
unbiased? Can you construct alternative estimators that are unbiased? 


Problem 9. Estimating the parameter of a uniform random variable II. We 
are given i.i.d. observations X1,..., Xn that are uniformly distributed over the interval 
[0,0 + 1]. Find a ML estimator of 6. Is it consistent? Is it unbiased or asymptotically 
unbiased? 


Problem 10. A source emits a random number of photons K each time that it is 
triggered. We assume that the PMF of K is 


pk(k;0)-c(0) 9^, | k—0.1.2,.... 


where @ is the inverse of the temperature of the source and c(0) is a normalization 
factor. Wealso assume that the photon emissions each time that the source is triggered 
are independent. We want to estimate the temperature of the source by triggering it 
repeatedly and counting the number of emitted photons. 


(a) Determine the normalization factor c(). 


(b) Find the expected value and the variance of the number K of photons emitted if 
the source is triggered once. 


(c) Derive the ML estimator for the temperature v = 1/0, based on K;...., Kn, the 
numbers of photons emitted when the source is triggered n times. 


(d) Show that the ML estimator is consistent. 


Problem 11.* Sufficient statistics — factorization criterion. Consider an obser- 
vation model of the following type. Assuming for simplicity that all random variables 
are discrete, an initial observation T' is generated according to a PMF pr(t;0). Having 
observed T', an additional observation Y is generated according to a conditional PMF 
py|r(y|t) that does not involve the unknown parameter 8. Intuition suggests that out 
of the overall observation vector X = (T, Y), only T is useful for estimating 0. This 
problem formalizes this idea. 

Given observations X = (Xi...., Xn). we say that a (scalar or vector) function 
q( X) is a sufficient statistic for the parameter 6 if the conditional distribution of X 
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given the random variable T = q(.X) does not depend on 6, i.e., for every event D and 
possible value t of the random variable T', 


Po(X € D|T =t) 


is the same for all 0 for which the above conditional probability is well-defined [i.e., for 
all 0 for which the PMF pr(t;0) or the PDF fr(t;0) is positive]. Assume that either 
X is discrete (in which case, T' is also discrete), or that both X and T are continuous 
random variables. 


(a) Show that T' 2 q(X) is a sufficient statistic for 0 if and only if it satisfies the 
following factorization criterion: the likelihood function px(z;0) (discrete 
case) or fx(z;0) (continuous case) can be written as r(a(z),0)s(z) for some 
functions r and s. 


(b) Show that if q(.X) is a sufficient statistic for 0, then for any function h of 0, q(.X) 
is a sufficient statistic for the parameter Ç = h(6). 


(c) Show that if q( X ) is a sufficient statistic for 0, a ML estimate of 6 can be written 
as O4 = e (a(X)) for some function ¢. Note: This supports the idea that a 
sufficient statistic captures all essential information about 0 provided by X. 


Solution. (a) We consider only the discrete case; the proof for the continuous case is 
similar. Assume that the likelihood function can be written as r(q(z), 6) s(x). We will 
show that T = q(X) is a sufficient statistic. 

Fix some t and consider some 0 for which Pe(T' = t) > 0. For any z for which 
q(x) £ t, we have P(X = z|T = t) = 0, which is trivially the same for all 0. Consider 
now any z for which q(x) = t. Using the fact Po (x =.= t) = Po(X =2,q(X) = 
q(z)) = Pe(X = 7), we have 


PoySgppegebeee»d- o Pe) 


P(T =t) ^ Po(T=t) 
bs r(t, 8)s(z) T r(t,80)s(z) 
$5  n(a(z),8)s(z) reo M7 s) 
{z | a(z)=t} {z| 4(z)=t} 
DEF CNN 
5 s(z) 
{z la(z)7t) 


so Pe(X = z|T = t) does not depend on 0. This implies that for any event D, the 
conditional probability Pe(.X € D|T = t) is the same for all 0 for which Pe (T = t) > 0, 
so T' is a sufficient statistic. 

Conversely, assume that T' — q(X) is a sufficient statistic. For any r with 
px(z;9) > 0, the likelihood function is 


px (z;0) = Pe(X = z|q(X) = q(z)) Pe(a(X) = a(z)). 


Since T' is a sufficient statistic, the first term on the right-hand side does not depend 
on 6, and is of the form s(x). The second term depends on z through q(x), and is of 
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the form r (a(z), 9). This establishes that the likelihood function can be factored as 
claimed. 


(b) This is evident from the definition of a sufficient statistic, since for ¢ = h(0), we 
have 
Pc(X € D|T =t) = Pe(X € D|T =t), 


so Pç (X eD|Tz t) is the same for all Ç. 


(c) By part (a), the likelihood function can be factored as r(a(z),6)s(z). Thus, a ML 
estimate maximizes r(q(x),0) over 6 [if s(x) > 0] or minimizes r(q(x),@) over 0 [if 
s(x) < 0], and therefore depends on z only through q(z). 


Problem 12.* Examples of a sufficient statistic I. Show that q(X) = Aii. Xi 
is a sufficient statistic in the following cases: 

(a) X1,..., Xn are i.i.d. Bernoulli random variables with parameter 0. 

(b) Xi,..., X4 are i.i.d. Poisson random variables with parameter 0. 


Solution. (a) The likelihood function is 
px(z;0) = PO — 9)", 


so it can be factored as the product of the function 69") (1 — 8)"79*(?, which depends 
on z only through q(z), and the constant function s(x) = 1. The result follows from 
the factorization criterion for a sufficient statistic. 


(b) The likelihood function is 


n n 

om eh 8” _ -opaz 1 

= |] px: (2s) =e Iz. 0 TP. 
1-1 i=1 


i2) Ti 


so it can be factored as the product of the function e~ 96%"), which depends on z only 
through q(r), and the function s(r) = 1/ II zi!, which depends only on z. The 
result follows from the factorization criterion for a sufficient statistic. 


Problem 13.* Examples of a sufficient Staristic II. Let X),...,Xn be ii.d. 
normal random variables with mean p and variance o?. Show that: 
(a) If ø? is known, q( X) = p Xi is a sufficient statistic for p. 
(b) If u is known, ps = X (X: — u)? is a sufficient statistic for c?. 
(a) If both 1, and o? are unknown, q(X) = (55... Xi, yy x?) is a sufficient statis- 
tic for (u, o°). 
Solution. Use the calculations in Example 9.4, and the factorization criterion. 


Problem 14.* Rao-Blackwell theorem. This problem shows that a general esti- 
mator can be modified into one that only depends on a sufficient statistic, without loss 
of performance. Given observations X = (X1,..., Xn), let T = q( X) be a sufficient 
statistic for the parameter 0, and let g(.X) be an estimator for 0. 
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(a) Show that Eg [o(X ) IT] is the same for all values of 0. Thus, we can suppress 
the subscript 0, and view 


(X)  E[gG IT] 
as a new estimator of 0, which depends on X only through T. 
(b) Show that the estimators g(.X) and g(X) have the same bias. 
(c) Show that for any 0 with vare (9(X)) < 00, 


Eo [(9(X) -6) ] x Ee[(g(X) - 0) ], for all 8. 


Furthermore, for a given 6, this inequality is strict if and only if 
Eo [var(9(X) i7)] > 0. 


Solution. (a) Since T = q(X) is a sufficient statistic, the conditional distribution 
P(X = z |T = t) does not depend on 6, so the same is true for Ee [g(X) IT]. 


(b) We have by the law of iterated expectations 


Eo [900] = Eo [E [9% 1T]] = Eo [20], 


so the biases of g(X) and g(X) are equal. 


(c) Fix some 0 and let be denote the common bias of g(.X) and g(X). We have, using 
the law of total variance, 


ll 
leo 
do 
S 
ln] 
EN 
Ras 
Ss 
+ 
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with the inequality being strict if and only if Eo [var (g(X ) iT)] » 0. 
Problem 15.* Let X;,..., Xn be ii.d. random variables that are uniformly dis- 
tributed over the interval (0, 6]. 


(a) Show that T = max;=),....n Xi is a sufficient statistic. 


Ttg 


(b) Show that g(X) = (2/n) $77 , X: is an unbiased estimator. 


(c) Find the form of the estimator 9(X) = E|o(X ) IT], and then calculate and 
compare Ee [(a(X) - 6)'] with Ee [(9(X) — 0)']. 
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Solution. (a) The likelihood function is 


1/0", if0 € maxiqi,...n zi € 0 € 1, 


fx (21, 03 Tnj 8) = fx (x1; 9) MA fXn(Tn; 0) == { 0, otherwise, 
and depends on z only through g(x) = maxi-i,.,n Ti. The result follows from the 
factorization criterion for a sufficient statistic. 


(b) We have 
Ee [g(X)] = y EX] - 255 = 


(c) Conditioned on the event {T = t}, one of the observations X; is equal to t. The 
remaining n — 1 observations are uniformly distributed over the interval [0, t], and have 
a conditional expectation equal to t/2. Thus, 


E[9(X)|T =t] = e[Domir=d = Z (t+ e. tH 


and 9(X) = E[g( X) | T] = (n+1)T/n. 

We will now calculate the mean squared error of the two estimators g(X) and 
g( X), as functions of 0. To this end, we evaluate the first and second moment of g(X ). 
We have 








Ee (9(X)] = Eo [E [g(X) | T]] = Eo [g(X)] = 6. 


To find the second moment, we first determine the PDF of T. For t € [0,0], we have 
Pe(T < t) = (t/0)", and by differentiating, fr(t;0) = nt^^!/0^. Thus, 


8e 
Eo [(800)] = (yE [T°] = (say / t? fr (t; 0) dt 
_ (n1y? e anii _ (n+1)? 
= ( t re p 47 nico 


Since g(X) has mean 6, its mean squared error is equal to its variance, and 











; 2 ; 2 _ (n+l)? 2 x 1 2 
Ee [a ) ) ] Ee [ca )) le = n(n + 2) x n(n 4- 2) 
Similarly, the mean squared error of g( X) is equal to its variance, and 
4 vc 98 /— 1,4 
1 " 
— > ——~ ] itive i . It 
It can be seen that Ge) for all positive integers n. It follows that 


E» [(4(X) - 8)"] < Eo[(9(X) - 9)’], 


which is consistent with the Rao-Blackwell theorem. 
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SECTION 9.2. Linear Regression 


Problem 16. An electric utility company tries to estimate the relation between the 
daily amount of electricity used by customers and the daily summer temperature. It 
has collected the data shown on the table below. 


Temperature 96 89 81 86 83 
Electricity | 23.67 20.45 21.86 23.28 20.71 


Temperature 73 78 74 76 78 
Electricity 18.21 18.85 20.10 18.48 17.94 





(a) Set up and estimate the parameters of a linear model that can be used to predict 
electricity consumption as a function of temperature. 


(b) If the temperature on a given day is 90 degrees, predict the amount of electricity 
consumed on that day. 


Problem 17. Given the five data pairs (ri, yi) in the table below, 


0.798 2.546 9.005 7.261 9.131 
—2.373 20.906 103.544 215.775 333.911 


T 
y 








we want to construct a model relating z and y. We consider a linear model 
Y; = 09 4-01, + Wi, m 

and a quadratic model 
Yi = po + bizi +V, i=1,...,5, 


where W; and V; represent additive noise terms, modeled by independent normal ran- 
dom variables with mean zero and variance c? and oĉ, respectively. 


(a) Find the ML estimates of the linear model parameters. 
(b) Find the ML estimates of the quadratic model parameters. 


(c) Assume that the two estimated models are equally likely to be true, and that the 
noise terms W; and V; have the same variance: o? = o2. Use the MAP rule to 
choose between the two models. 


Problem 18.* Unbiasedness and consistency in linear regression. In a prob- 


abilistic framework for regression, let us assume that Y; = 0o +612: + Wi, i = 1,...,m, 
where Wi,...,W4 are i.i.d. normal random variables with mean zero and variance o°. 
Then, given z; and the realized values y; of Yi, i = 1,...,n, the ML estimates of ĝo 


and ĝ are given by the linear regression formulas, as discussed in Section 9.2. 


(a) Show that the ML estimators Ôo and ©; are unbiased. 
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(b) Show that the variances of the estimators Ôo and Ô; are 
n 
2 
2» 
i 1-1 
-—LÀ———, — — 
ET. — 
nY (z: — T) 3c — £) 
i=l 


respectively, and their covariance is 


^ 


var(Oo) 


(c) Show that if 377 ,(z; — z)? — oo and 2 is bounded by a constant as n — oo, we 
have var(Óo) — 0 and var(Ó1) — 0. (This, together with Chebyshev's inequality, 


implies that the estimators Óo and O1 are consistent.) 


Note: Although the assumption that the W, are normal is needed for our estimators to 
be ML estimators, the argument below shows that these estimators remain unbiased 
and consistent without this assumption. 


Solution. (a) Let the true values of ĝo and 01 be 605 and 61, respectively. We have 


^ 


3-3) - Y) 
ô, = = 


$ (i-z) 
i=1 


where Y = (OUS Y;)/n, and where we treat £1,..., Tn as constant. Denoting W = 
n 
(o si W:)/n, we have 


Y; = 06 + 01 zi + Wi, Y = 65 + 017+ W, 


; Ôo = Y — ÔF, 


and = = 
Yi - Y = 01 (zi - T) + (Wi - W). 
Thus, 
Sa — 7) (0i (z: — z) + Wi - W) So (zi - z)(W; - W) 
Ô, = ees 8i 4 SE 


n - zy S (ei - 3) 


i=l 
x — z)Wi 


* i=1 
= 01 + n + 


» (z -z) 
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where we have used the fact $7^ ,(z; — z) = 0. Since E[W;] = 0, it follows that 
EJ] = 61. 
Also 
Oo -Y-Óiz-2605401z 4-W - Óiz = 05 + (07 — Óiyz - W, 
and using the facts E[Ói] = 67 and E[W] = 0, we obtain 
E[Ó»] = 65. 


Thus, the estimators Óo and Ói are unbiased. 


(b) We now calculate the variance of the estimators. Using the formula for Ó, derived 
in part (a) and the independence of the W,, we have 


n 


Yos - my var(W;) 


var(@1) = = 


fa. - AW ee st 
(£e y) 2 n -s 


il 


c? 


Similarly, using the formula for Óo derived in part (a), 
var(Óo) = var(W — 6,2) = var(W) + z?var(Ó1) — 2z cov(W , i). 
Since $7" ,(z; — 7) = 0 and E[WW,;] = c? /n for all i, we obtain 


E Ime - o" LS in - 7) 


X (t: - zy m - 2) 
i=1 i=1 


Combining the last three equations, we obtain 


Qo 
2 
re 
xi 
D 
E 
l 


0. 


c? Zo? c i=l 
= 


n n 


De-a Ee- 


i=1 i=1 


var(Ôo) = var(W) + z?var(Ói) = - 


By expanding the quadratic forms (z; — z)?, we also have 


n n 
Yn. - 2)? +08? = Sai. 
i=l i=1 
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By combining the preceding two equations, 


var(Óo) = 


—Ó————. 
n 2 - 7)? 
i=l 


We finally calculate the covariance of Ôo and Ó,. We have 
cov(@o, 61) = E[(@o — 65)(6: - 01)] = E[((6? - 6) + W) (©: - 97)], 
or Ee . - 
cov(8o, 81) = —£ var(01) + cov(W, 01). 
Since, as shown earlier, cov(W, ài) — 0, we finally obtain 


A A zo? 


n 


$c. -a 


i=1 


) If D zi T)? — oc, the expression for var(Ó1) — 0 derived in part (b) goes to 
zero. Then, the formula 
var(Óo) = var(W) + i?^var(Ó1), 
from part (b), together with the assumption that Z? is bounded by a constant, implies 
that var(Oo) — 0. 


Problem 19.* Variance estimate in linear regression. Under the same assump- 
tions as in the preceding problem, show that 


22. 1 : TE A 42 

Sn = aaa — 9o - Gizi) 
is an unbiased estimator of c?. 
Solution. Let V, = Y- Ôo — Óiz)?. Using the formula Ôo = Y — Óiz and 
the expression for Ô, we have 


Vn = y (Y-Y -ô - 3)" 


i=1 
=J Y-Y) - 261: (Y; -Y)(z: - 2) + ôi S (z: - 2)? 
i=l i-1 i=1 


n 


-Y-Yy és 9 


i-—l i=l 


= y -nY - ê? Se - zy 
i=l 1=1 


518 Classical Statistical Inference Chap. 9 


Taking expectation of both sides, we obtain 


n 


E[V.] = $ E[Y?] - nE[Y |-Xe - zE[61]. 


i=l 
We also have 
E[Y?] = var(¥i) + (ElY.])? = 0? (65 6122", 
2 
E[Y^] = var(¥) + (E(Y]) = = + (65 + 632)’, 


c? 


Se um z)? 


iml 


E[Ô}] = var(61) + (E[6:])” = + (6. 


Combining the last four equations and simplifying, we obtain 


E[V.] = (n — 2)c?. 


SECTION 9.3. Binary Hypothesis Testing 


Problem 20. A random variable X is characterized by a normal PDF with mean 
uo = 20, and a variance that is either o2 = 16 (hypothesis Ho) or o? = 25 (hypothesis 
Hi). We want to test Ho against Hj, using three sample values 21,22,23, and a 
rejection region of the form 


R= (z|zi- zo 23 > y) 


for some scalar y. Determine the value of y so that the probability of false rejection is 
0.05. What is the corresponding probability of false acceptance? 


Problem 21. A normal random variable X is known to have a mean of 60 and a 
standard deviation equal to 5 (hypothesis Ho) or 8 (hypothesis Hı). 


(a) Consider a hypothesis test using a single sample x. Let the rejection region be 
of the form 
R= {z | |z — 60| >} 


for some scalar y. Determine y so that the probability of false rejection of Ho is 
0.1. What is the corresponding false acceptance probability? Would the rejection 
region change if we were to use the LRT with the same false rejection probability? 


(b) Consider a hypothesis test using n independent samples z1,...,%n. Let the 
rejection region be of the form 
z 
R= Gin) | |[:—— -e) >v}, 


where y is chosen so that the probability of false rejection of Ho is 0.1. How does 
the false acceptance probability change with n? What can you conclude about 
the appropriateness of this type of test? 
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(c) Derive the structure of the LRT using n independent samples 21,...,2n. 


Problem 22. There are two hypotheses about the probability of heads for a given 
coin: 0 = 0.5 (hypothesis Ho) and 0 = 0.6 (hypothesis Hi). Let X be the number of 
heads obtained in n tosses, where n is large enough so that normal approximations are 
appropriate. We test Ho against Hi by rejecting Ho if X is greater than some suitably 
chosen threshold kn. 


(a) What should be the value of kn so that the probability of false rejection is less 
than or equal to 0.05? 


(b) What is the smallest value of n for which both probabilities of false rejection and 
false acceptance can be made less than or equal to 0.05? 


(c) For the value of n found in part (b), what would be the probability of false 
acceptance if we were to use a LRT with the same probability of false rejection? 


Problem 23. The number of phone calls received by a ticket agency on any one 
day is Poisson distributed. On an ordinary day, the expected value of the number of 
calls is Ao, and on a day where there is a popular show in town, the expected value 
of the number of calls is A1, with A1 > Ao. Describe the LRT for deciding whether 
there is a popular show in town based on the number of calls received. Assume a given 
probability of false rejection, and find an expression for the critical value £. 


Problem 24. We have received a shipment of light bulbs whose lifetimes are modeled 
as independent, exponentially distributed random variables, with parameter equal to 
Ao (hypothesis Ho) or equal to A1 (hypothesis Hi). We measure the lifetimes of n 
light bulbs. Describe the LRT for selecting one of the two hypotheses. Assume a given 
probability of false rejection of Ho and give an analytical expression for the critical 
value £. 


SECTION 9.4. Significance Testing 


Problem 25. Let X bea normal random variable with mean p and unit variance. We 
want to test the hypothesis u = 5 at the 5% level of significance, using n independent 
samples of X. 


(a) What is the range of values of the sample mean for which the hypothesis is 
accepted? 


(b) Let n = 10. Calculate the probability of accepting the hypothesis u = 5 when 
the true value of p is 4. 
Problem 26. "We have five observations drawn independently from a normal distri- 
bution with unknown mean p and unknown variance o?. 
(a) Estimate u and c? if the observation values are 8.47, 10.91, 10.87, 9.46, 10.40. 


(b) Use the t-distribution tables to test the hypothesis u = 9 at the 95% significance 
level, using the estimates of part (a). 
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Problem 27. A plant grows on two distant islands. Suppose that its life span 
(measured in days) on the first (or the second) island is normally distributed with 
unknown mean ux (or py) and known variance o2, = 32 (or o? = 29. respectively). 
We wish to test the hypothesis ux = pry. based on 10 independent samples from each 
island. The corresponding sample means are Z = 181 and y = 177. Do the data 
support the hypothesis at the 9596 significance level? 


Problem 28. A company considers buying a machine to manufacture a certain item. 
When tested. 28 out of 600 items produced by the machine were found defective. Do 
the data support the hypothesis that the defect rate of the machine is smaller than 3 
percent. at the 5% significance level? 


Problem 29. The values of five independent samples of a Poisson random variable 
turned out to be 34, 35, 29. 31. and 30. Test the hypothesis that the mean is equal to 
35 at the 5% level of significance. 


Problem 30. A surveillance camera periodically checks a certain area and records 
a signal X = W if there is no intruder (this is the null hypothesis Ho). If there is an 
intruder the signal is X — 0 -- W. where 0 is unknown with 0 » 0. We assume that W 
is a normal random variable with mean 0 and known variance v = 0.5. 


(a) We obtain a single signal value X = 0.96. Should Ho be rejected at the 5% level 
of significance? 

(b) We obtain five independent signal values X = 0.96, —0.34, 0.85, 0.51, —0.24. Should 
Ho be rejected at the 5% level of significance? 


(c) Repeat part (b). using the t-distribution, and assuming the variance v is un- 
known. 
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Markov property, 340 
Matchbox problem, 120 
Matched filter, 428 
Matrix 
doubly stochastic, 391 
transition, 341, 345 
Maximum a Posteriori, see MAP 
Maximum likelihood, 459, 462 
asymptotic normality, 465 
consistency, 465 
in regression, 478 
invariance principle, 465 
Maximum of r.v.'s, 150 
Mean, see expectation 
Mean first passage time, 367, 398 
Mean recurrence time, 367, 399 
Mean squared error, 430, 431 
conditional, 425, 430 
Measurement, 412 
Merging 
Bernoulli processes, 305 
Poisson processes, 319 
Minimum of r.v.'s, 124 
Mixed r.v., 186 
Mixture of distributions, 236 
ML, see maximum likelihood 
Model inference, 409 
Moment, 83, 145 
calculation using transforms, 232 
Moment generating function, 229, 232 
see also transform 
Monotone convergence theorem, 293 
Monotonic function of a r.v., 207 
Monty Hall problem, 27 
Multicollinearity, 484 
Multiple regression, 482 
Multinomial 
coefficient, 49 
distribution, 124, 503 
Multiplication rule 
for PDFs, 172, 194 
for PMFs, 101, 103, 131 
for probabilities, 24 
Multivariate transform, 240 
Mutual information, 137 
Mutually exclusive, 7 


N 


Negative binomial r.v., 327 
Neyman-Pearson lemma, 491 
Nonlinear regression, 483 
Nonnegativity axiom, 9 
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Nonparametric problem, 411 
Normal approximation, 275 
Normal r.v. 
bivariate, 254 
CDF, 154 
central limit theorem, 273 
confidence interval, 469 
estimation, 415, 421, 449, 464 
independent, 176 
linear function of, 154, 206 
mean, 153 
model, 443 
normalization property, 153, 189 
standard, 154 
sum of, 214, 238 
table, 155 
transform, 231, 239 
uncorrelated, 254 
variance, 153 
Normalization axiom, 9 
Normalization equation, in Markov chains, 354 
Null hypothesis, 485, 496 


(0) 


Observation, 412 
Ordered pair, 5 
Outcome, 6 

mutually exclusive, 7 
Overfitting, 484 


P 


p-value, 499 
Pairwise independence, 39 
Paradox 
Bertrand's, 16 
of induction, 58 
random incidence, 322 
St. Petersburg, 123 
two-envelopes, 106 
Parallel connection, 41 
Parametric model, 496 
Partition, 4, 49 
Pascal, 17, 62, 65 
Pascal r.v., 303, 327 
Pascal triangle, 65 
PDF, 140 
conditional, 165, 167, 169, 172 
joint, 158 
marginal, 159 
of function of r.v.'s, see derived 
distributions 
of linear function of r.v.'s, 205 
of monotonic function of r.v.'s, 207 
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properties, 144 
spherically invariant, 451 
Periodic class, 351, 383 
Permutation, 46, 69 
k-permutation, 46 
Perron-Frobenius theorem, 354 
Piecewise constant PDF, 142 
mean, 174 
variance, 174 
PMF, 74 
calculation of, 75 
conditional, 98, 100 
summary, 103 
joint, 92 
marginal, 93 
Point estimate, 422 
Poisson, 17 
Poisson process, 297, 309 
alternative description, 316 
arrival rate, 310 
definition, 310 
independence, 310, 314 
intensity, 310 
interarrival times, 315 
memorylessness, 314 
merging, 319 
number of arrivals, 311, 335 
random incidence, 322 
small interval probabilities, 310 
splitting, 318, 336 
time-homogeneity, 310 
time until first arrival, 312 
time until kth arrival, 316 
Poisson random variable, 78, 117, 312 
approximation of binomial, 79, 121 
mean, 90 
parameter estimation, 449 
splitting, 132 
sum of, 237, 247, 313 
transform, 230, 239 
variance, 113 
Polling, 270, 276, 474 
Posterior, 32, 409, 412 
Prior, 32, 409 
flat, 463 
Prisoner's dilemma, 58 
Probabilistic model, 6 
Probability 
conditional, 18 
history, 17 
steady-state, see steady-state probability 
subjective, 3 
Probability density function, see PDF 
Probability mass function, see PMF 
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Probability law, 6, 8 
of random variable, 148 
properties, 14 
Problem of joint lives, 128 
Problem of points, 62 
Process 
arrival, 296 
Bernoulli, 297 
birth-death, 359, 377 
Erlang, 337 
Markov, 297 
Poisson, 297, 309 


Q 


Queueing, 336, 348, 365, 373, 392 
Quiz problem, 91, 125 


R 


Random incidence 
in Bernoulli process, 327 
in Erlang process, 337 
in Poisson process, 322 
in non-Poisson process, 323 
Random variable 
continuous, 140 
definition, 73 
discrete, 73, 74 
function of, 73, 74, 80, 93, 202 
independent, 110, 175, 177 
mixed, 186 
Random walk, 360 
Rao-Blackwell theorem, 511 
Rayleigh r.v., 248 
Recurrence time, 367, 399 
Recurrent 
class, 348 
aperiodic, 350 
multiple, 398 
periodic, 350 
state, 347, 382 
existence, 381 
Recursive inference, 416 
Regression 
Bayesian, 480, 481 
causality, 485 
consistency, 514 
formulas, 477 
heteroskedasticity, 484 
linear, 459, 475 
multicollinearity, 484 
nonlinear, 483 
nonlinearity, 484 
overfitting, 484 


Index 


multiple, 482 
simple, 482 
unbiasedness, 514 
variance estimate, 518 
Rejection region, 486, 497, 498 
Relative frequency, 9, 17, see also frequency 
interpretation 
Reliability, 40, 61 
Residual, 476 
Reversibility, 394 


S 


Sample mean, 114, 264, 466 
mean, 114,264 
unbiased, 466 
variance, 114, 264 

Sample space, 6 
uncountable, 7 

Schwartz inequality, 250 

Sequential model, 8, 22 

Sequential method, 52 

Series connection, 41 

Set, 3 
complement, 4 
countably infinite, 3 
disjoint, 4 
element of, 3 
empty, 3 
intersection, 4 
partition, 4 
uncountable, 4 
union, 4 
universal, 4 

Shannon, 136 

Signal detection, see detection 

Significance level, 497, 498 

Significance testing, 459, 460, 495 

Simulation, 115, 188, 190, 196, 489, 499 

Spherically invariant PDF, 451 

Splitting 
Bernoulli process, 305 
Poisson r.v., 132 
Poisson process, 318, 336 

St. Petersburg paradox, 123 

Standard deviation, 83, 86 

Standard normal random variable, 154 

Standard normal table, 155 
use for calculating normal CDF, 156 

State 
absorbing, 346, 362 
accessible, 347 
classification, 346 
of Markov chain, 340 
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recurrent, 347, 381, 382 
transient, 347, 381 
State space, 340 
Stationary distribution, 353 
Statistic, 497 
sufficient, 509, 511 
Statistics, 408 
Bayesian, 408, 411, 412 
classical, 408, 459 
frequentist, 408 
nonparametric, 411 
Steady-state, in Markov chains, 352, 375 
Steady-state convergence theorem, 352, 375, 
387 
Steady-state probability, 352, 375 
Stochastic process, 296 
Strong law of large numbers, 280 
for Markov chains, 402 
proof, 294 
Subset, 4 
Sufficient statistic, 509, 511 
Sum of random number of r.v.'s, 240 
geometric number of exponential r.v.'s, 243, 
335 
geometric number of geometric r.v.'s, 244, 
328 
mean, 197, 241 
variance, 197, 241 
transform, 241 
Sum of random variables 
of Bernoulli, 261 
convolution, 213 
expectation, see expectation, linearity 
of normal, 214, 238 
of Poisson, 237, 313 
transform, 237 
variability extremes, 135 
variance, 111, 112, 113, 178, 220 


T 


t-distribution, 471, 500 
tables, 473 
Tabular method, 93, 94 
'Total expectation theorem 
for continuous r.v.'s, 173 
for discrete r.v.'s, 104 
Total probability theorem, 28 
conditional version, 59 
for PDFs, 167 
for PMFs, 103 
Transient state, 347, 381 
Transition 
graph, 341 
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matrix, 341, 345 
probability, 340 
rate, 371 
'Iransform, 229 
inversion property, 234, 240 
multivariate, 240 
of linear function, 231 
of mixture, 236 
of sum of r.v.'s 237 
of sum of random number of r.v.'s, 241 
table, 239 
Trial 
Bernoulli, 41 
independent, 41 
Two-envelopes paradox, 106 
Two-envelopes puzzle, 58 
Type I error, 486 
Type II error, 486 


U 


Unbiased, 436, 461 
asymptotically, 461 
Uncountable, 4, 7, 13, 53 
Uncorrelated, 218, 225, 254, 436 
Uniform r.v. 
continuous, 141, 182 
mean, 145 
relation to Bernoulli process, 328 
transform, 239, 258 
variance, 145 
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discrete, 116 
mean, 88 
transform, 239, 258 
variance, 88 
two-dimensional continuous, 159 


Vv 


Value of a r.v., 72 
Variability extremes, 135 
Variable inference, 410 
Variance, 83, 86, 145 
conditional, 104, 226 
estimation, 468 
law of total, 226 
of linear function of r.v.'s, 87, 145 
of product of r.v.'s, 192, 252 
of sum of r.v.'s, 111, 112, 113, 178, 220 
in terms of moments, 87, 145 
Venn diagram, 5, 15 


w 


Weak law of large numbers, 269 


Z 


z-transform, 230 


Summary of Results for Special Discrete Random Variables 


Discrete Uniform over [a,b]: 


1 
——, ifk=a, | Oren? 
ZEE M Un d | 


0, otherwise, 


a+b esa (es(b-a+1) = 1) 


(5: bread 
` (b—a + 1)(es — 1) 


Eee 12 


Mx(s) = 





var( X) = 


Bernoulli with Parameter p: (Describes the success or failure in a single 
trial.) 


_Jp if k = 1, 
px - UP, if k = 0. 
E[X] = p. var( X) = p(1 — p). Mx(s) 21-p- pes. 


Binomial with Parameters p and n: (Describes the number of successes 
in n independent Bernoulli trials.) 


px(k) = (æa — p)^-*, km) s: n. 
E[X] = np. var(X) = np(1 — p). Mx(s) = (1 — p + pes)". 


Geometric with Parameter p: (Describes the number of trials until the 
first success, in a sequence of independent Bernoulli trials.) 


px(k) = (1 — p)*-!p. am ay eee 





EX] = 5, ags e dies 


Poisson with Parameter A: (Approximates the binomial PMF when n 
is large, p is small, and \ = np.) 





Summary of Results for Special Continuous Random Variables 
Continuous Uniform Over [a,b]: 
l fa<r<b 
fur) baa erc 
0, otherwise, 


a+b 


Exponential with Parameter A: 


àe7òt, ifx>oO, 1—e-òr, ifr20, 


fx(z) = Lo otherwise, Fx) = t otherwise, 


E[X] = zd Mx(s) = Z, (s < A). 


Normal with Parameters ; and c? > 0: 


Ed E 


E[X] =n, var(X) = o?, Mx (s) = e(0787/2)+us. 


e-(z—p)?/20? | 








The standard normal table. The entries in this table provide the numerical values 
of ®(y) = P(Y < y), where Y is a standard normal random variable, for y between 0 
and 3.49. 

Example of use: To find (1.71), we look at the row corresponding to 1.7 and the 
column corresponding to 0.01, so that (1.71) = .9564. When y is negative, the value 
of $(y) can be found using the formula (y) = 1 — $(—y). 
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